# Introduction to statistics

The following is a glossary of statistical terms used throughout this module. For more information on any of the terms please click on the relevant link.

## A

Alternative hypothesis Also known as the research hypothesis, and denoted by $$\textrm{H}\textrm{A}$$, the alternative hypothesis always states the opposite of the null hypothesis; i.e. it states that there _is a difference or relationship between variables in a population.
See Hypothesis testing, Null hypothesis, Variable.

## C

Categorical data Data which is grouped into categories, such as data for a ‘gender’ or ‘smoking status’ variable. Categorical data can be further classified as nominal or ordinal.
See Data, Nominal data, Ordinal data.

Chi-square test A non-parametric test used to determine whether there is a statistically significant association between two categorical variables. The chi-square value is represented by $$\chi^2$$
See Categorical data, Non-parametric test.

Cohen’s $$d$$ A measure of effect size which determines how many standard deviations two means are separated by. It is commonly used to evaluate practical significance for $$t$$ tests and ANOVA.
See Effect size, $$t$$ test, One-way ANOVA.

Confidence interval A range of values that a population statistic (e.g. the population mean) is expected to lie between with a given level of certainty (known as the confidence coefficient). The confidence coefficient is typically $$95\%$$, in which case it is referred to as a $$95\%$$ confidence interval.
See Population.

Confounding variable A variable that may have an influence on the dependent (outcome) variable.
See Variable.

Continuous data Data which is measured on a continuous numerical scale and which can take on a large number of possible values, such as data for a ‘weight’ or ‘distance’ variable. Continuous data can be further classified as interval or ratio.
See Interval data, Ratio data, Variable.

Cross-tabulation (contingency table) A table used to display information for two categorical variables. Categories of the independent variable are listed in the rows and categories of the dependent variable are listed in the columns, with each cell containing the frequency (number) of subjects that fall into that combination of categories. Percentages are often also included, along with totals.
See Categorical data, Dependent variable, Independent variable.

## D

Data Observations and measurements which have been collected in some way, often through research. Quantitative data measures quantities and is recorded as numbers, while qualitative data records qualities in terms of different categories or in terms of thoughts, feelings and opinions.

Discrete data Data which measures counts or numbers of events, such as data for a ‘class attendance’ variable. It can be treated as either categorical or continuous, depending on how many values are possible.
See Data.

Degrees of freedom The number of values that are free to vary when calculating an estimate. This is commonly reported as part of the results of various hypothesis tests.
See Hypothesis testing.

Dependent variable (outcome variable) When testing for a relationship between pairs of variables, the dependent variable is the one that is potentially influenced, affected or predicted by the other variable.
See Variable.

Descriptive statistics Statistics that are used to summarise and describe a variable or variables for a sample of data.
See Data, Sample, Variable.

## E

Effect size Effect size measures the magnitude of a difference or relationship between variables. It is used to provide evidence of whether it is meaningful in real life (i.e. has practical significance), and is calculated differently for different statistical tests.
See Cohen’s $$d$$, Odds ratio, Practical significance, Variable.

## H

Hypothesis testing Hypothesis testing is used to determine whether a difference or relationship observed in a sample is statistically significant in terms of the population from which the sample was drawn. This can be assessed by interpreting the resulting $$p$$ value.
See Alternative hypothesis, Null hypothesis, $$p$$ value, Population, Sample, Statistical significance.

## I

Independent samples $$t$$ test A parametric inferential statistical test used to determine whether there is a statistically significant difference between the mean of a continuous variable for two independent (unrelated) groups.
See Continuous data, Inferential statistics, Mean, Parametric test, Statistical significance.

Independent variable (predictor or exposure variable) When testing for a relationship between pairs of variables, the independent variable is the one that potentially influences, affects or predicts the other variable.
See Variable.

Inferential statistics Statistics that are used to draw inferences about the wider population from which a sample of data was drawn.
See Population, Sample.

Interquartile range The interquartile range is a measure of dispersion appropriate in situations where the median is used as the measure of central tendency. It is calculating by finding the difference between the first and third quartiles.
See Measure of central tendency, Measure of dispersion, Median, Quartiles.

Interval data Continuous data that does not have an absolute zero, and where negative numbers also have meaning, such as for a ‘temperature in degrees Celcius variable’.
See Continuous data.

## L

Level of significance In a hypothesis test, the level of significance (denoted by $$\alpha$$) determines exactly how small the $$p$$ value can be before the null hypothesis is rejected. It is typically $$5\%$$ ($$.05$$)
See Hypothesis testing, Null hypothesis, $$p$$ value

## M

Mean The mean is the arithmetic average of a data set, calculated by adding all of the data together and dividing through by the total number of values. It is the most commonly used measure of central tendency. The sample standard deviation is denoted by $$\bar{x}$$, while the population mean is denoted by $$\mu$$.
See Measure of central tendency, Population, Sample.

Measure of central tendency A descriptive statistic which summarises a continuous variable by finding the average, central or typical member. Examples of measures of central tendency are the mean, median and mode.
See Continuous data, Descriptive statistic, Mean, Median, Mode.

Measure of dispersion A descriptive statistic which summarises a continuous variable by finding out how widely it is spread or dispersed. Examples of measures of dispersion are the range, interquartile range, variance and standard deviation.
See Continuous data, Descriptive statistic, Interquartile range, Range, Standard deviation, Variance.

Median The median is a more appropriate measure of central tendency than the mean when the data is affected by outliers or is skewed. It is calculated by finding the middle value (or average of two middle values) when the data set is sorted from smallest to largest.
See Mean, Measure of central tendency, Outlier, Skewness.

Mode The mode is the most frequently occurring value (or values) in the data set; it is a less commonly used measure of central tendency.
See Measure of central tendency.

## N

Nominal data Categorical data where the categories do not have an order, such as for a ‘marital status’ variable. If there are only two categories, then the terms binary and/or dichotomous are often also used.
See Categorical data.

Non-parametric test An inferential statistical test that doesn’t require the variable(s) to be normally distributed, and doesn’t require continuous data.
See Continuous data, Inferential statistics, Normal distribution.

Normal distribution A distribution (spread) of data that has two key properties:
1. The mean, median and mode are all equal.
2. Fixed proportions of the data lie within certain standard deviations of the mean ($$68\%$$ within one standard deviation, $$95\%$$ within two standard deviations and $$99.7\%$$ within three standard deviations).
Large amounts of naturally occurring data often approximates this distribution, and it is an assumption for parametric tests.
See Data, Mean, Median, Mode, Parametric test, Standard deviation.

Null hypothesis Denoted by $$\textrm{H}\textrm{0}$$, the null hypothesis always states that there is _no difference or relationship between variables in a population.
See Alternative hypothesis, Hypothesis testing, Variable.

## O

Odds ratio A measure of effect size used when testing for association between an exposure and an outcome (e.g. using a Chi-square test), an odds ratio compares the odds of exposure in the group with the outcome, to the odds of exposure in the group without the outcome. An odds ratio of $$1$$ indicates no difference between the two groups, while an odds ratio greater than $$1$$ indicates that the group with the outcome are more likely to have had the exposure, and an odds ratio less than $$1$$ indicates that the group with the outcome are less likely to have had the exposure.
See Chi-square test.

One sample $$t$$ test A parametric inferential statistical test used to determine whether there is a statistically significant difference between the mean of a continuous variable and a test value (some hypothesised value).
See Continuous data, Inferential statistics, Mean, Parametric test, Statistical significance.

One-way ANOVA (analysis of variance) A parametric inferential statistical test used to determine whether there are any statistically significant differences between the means of a continuous variable for three or more independent (unrelated) groups.
See Continuous data, Inferential statistics, Mean, Parametric test, Statistical significance.

Ordinal data Categorical data where the categories do have an order, such as for a ‘satisfaction level’ variable.
See Categorical data.

Outlier An outlier is any data point that lies well above or below the other data; in particular, over $$1.5$$ interquartile ranges below the first quartile or $$1.5$$ interquartile ranges above the third quartile.
See Interquartile range, Quartiles.

## P

$$p$$ value The $$p$$ value for a hypothesis test is the probability of obtaining a given test statistic if the null hypothesis is true. A small $$p$$ value indicates a low probability, and in particular if the $$p$$ value is less than the level of significance it is evidence to reject the null hypothesis (and hence of statistical significance).
See Hypothesis test, Level of significance, Null hypothesis, Statistical significance, Test statistic.

Paired samples $$t$$ test A parametric inferential statistical test used to determine whether there is a statistically significant difference between the means of continuous variables for two related groups.
See Continuous data, Inferential statistics, Mean, Parametric test, Statistical significance.

Parametric test An inferential statistical test that requires at least one continuous variable, and which requires continuous variables to be normally distributed.
See Continuous data, Inferential statistics, Normal distribution.

Pearson’s correlation coefficient Pearson’s correlation coefficient, denoted by $$r$$, is used to determine whether there is a linear correlation (straight line relationship) between two continuous variables. It can range from $$-1$$ to $$1$$, with values close to $$-1$$ indicating strong negative correlation, values close to $$1$$ indicating strong positive correlation, and values close to $$0$$ indicating no correlation.
See Continuous data.

Percentiles A measure of dispersion that measures position from the beginning of an ordered data set, and can be used to measure the relative standing of a particular data point.
See Measure of dispersion.

Population A population is every member of a group of interest. Normally it is not possible or feasible to collect data from the entire population, so a random sample is used instead to draw inferences about the population.
See Data, Inferential statistics, Sample.

Power The power of a hypothesis test is the probability that the test will find an effect if one actually exists; in other words, that an incorrect null hypothesis will in fact be rejected.
See Hypothesis test, Null hypothesis.

Practical significance Practical significance refers to whether or not a difference or relationship between variables is meaningful in a practical sense (i.e. in real life). It is determined by calculating an effect size.
See Effect size, Variable.

## Q

Quartiles A specific type of percentiles which divide the data set into quarters. In particular, the $$25$$th percentile is known as the first or lower quartile, the $$50$$th percentile is known as the median, and the $$75$$th percentile is known as the third or upper quartile.
See Median, Percentiles

## R

Range The simplest measure of dispersion, the range is the difference between the smallest and largest value in a data set.
See Measure of dispersion.

Ratio data Continuous data that does have an absolute zero, and where negative numbers do not have meaning, such as for a ‘height’ variable.
See Continuous data.

## S

Sample A sample is a subset of a population. It can be analysed using descriptive statistics, or used to draw inferences about the wider population using inferential statistics.
See Descriptive statistics, Inferential statistics, Population.

Standard deviation Standard deviation is the most commonly used measure of dispersion, appropriate in situations where the mean is used as the measure of central tendency. It is the square root of the variance, and measures how much deviation there is from the mean. Sample standard deviation is denoted by $$s$$, while population standard deviation is denoted by $$\sigma$$.
See Mean, Measure of central tendency, Measure of dispersion, Population, Sample, Variance.

Statistical significance Statistical significance refers to whether or not a difference or relationship between variables observed in a sample could have occurred due to random chance alone. It is determined by conducting a hypothesis test.
See Hypothesis test, Sample, Variable.

## T

Test statistic A value calculated as the result of a hypothesis test, the test statistic compares the value of the sample statistic (for example, the sample mean) with the value specified by the null hypothesis for the population statistic.
See Hypothesis testing, Mean, Null hypothesis, Population, Sample.

## V

Variable A characteristic or attribute that you are observing, measuring and recording data for, e.g height, weight, eye colour, dog breed, etc.
See Data.

Variance A measure of dispersion that measures how much deviation there is from the mean, the square root is usually taken in order to find the standard deviation.
See Mean, Measure of dispersion, Standard deviation.