Regression analysis - Introduction to statistics - UniSkills

Regression analysis is a statistical method used to model and explain the relationship between variables, and to make predictions about one variable based on one or more others. Different types of regression are required depending on the type of variables involved, and this page outlines two of the most common types.

In brief, it covers the following:

When to use linear regression, and how to interpret the results
When to use logistic regression, and how to interpret the results

Linear regression

Linear regression is used to model the relationship between one or more independent variables (which can be continuous or categorical) and a continuous dependent variable. It builds on the idea of correlation, as it also describes a linear relationship between variables, but it allows this relationship to be expressed as an equation that can be used to make predictions. Furthermore, while simple linear regression involves only one independent variable (as with correlation), most of the time there are actually multiple factors that may influence the outcome - and these can all be accounted for in multiple linear regression.

For example, multiple linear regression can be used to examine how factors such as heart rate before exercise, age and gender influence heart rate after exercise. It does this by providing information about how well the variables explain heart rate after exercise overall, as well as the effect of each variable individually while accounting for the other variables in the model.

Before conducting linear regression, you need to check that the following assumptions are valid:

Assumption 1: The observations are independent, meaning that measurements for one subject have no bearing on any other subject’s measurements.

Assumption 2: There is a linear relationship between each independent variable and the dependent variable. This can be assessed by examining scatter plots.

Assumption 3: There are no significant outliers that unduly influence the results.

Assumption 4: The independent variables are not too highly correlated with each other (multicollinearity), as this can make it difficult to determine how each variable is influencing the dependent variable.

Assumption 5: The residuals (differences between the observed and predicted values) are approximately normally distributed, and their variability is similar across all values of the predicted variable (homoscedasticity).

In adition, note that it is also important to have a sufficiently large sample size relative to the number of independent variables included in the regression model. A commonly used rule of thumb is to have around \(10–15\) cases for each independent variable.

Assuming the assumptions for linear regression are met, and the analysis is conducted using statistical software (e.g. SPSS as in this example), the results should include the following statistics:

The first of these output tables is the Model Summary, which includes \(R^2\) and adjusted \(R^2\) values. While these both provide an indication of how well the model explains the variation in the dependent variable, the adjusted \(R^2\) value takes into account the number of independent variables in the model and provides a more accurate estimate of how well the model would perform in the population, so is typically the one used. In this case, the adjusted \(R^2\) value of \(.131\) indicates that \(13.1\%\) of the variation in heart rate after exercise can be explained by the independent variables in the model (heart rate before exercise, age and gender).

The second output table is the ANOVA table, which includes the overall \(p\) value for the model. This is used to determine whether the model as a whole is statistically significant. In this case, the \(p\) value of \(.045\) indicates that the model is statistically significant at the \(.05\) level.

The final output table is the Coefficients table, which shows the effect of each independent variable on the dependent variable while accounting for the other variables in the model. It includes both unstandardised coefficients (\(B\)) and standardised coefficients (\(\beta\)), along with associated \(p\) values. While both types of coefficient provide information about the relationship between the independent and dependent variables, the unstandardised coefficients are typically used for interpretation, as they indicates the expected change in the dependent variable for a one-unit increase in the independent variable.

In this example, heart rate before exercise has a positive coefficient (\(B = 0.763\)) and is statistically significant (\(p = .020\)). This indicates that, after accounting for age and gender, heart rate after exercise is expected to increase by \(0.763\) beats per minute for each additional beat per minute in heart rate before exercise.

Age also has a positive coefficient (\(B = 0.217\)) but is not statistically significant (\(p = .346\)). This indicates that, after accounting for the other variables, heart rate after exercise is expected to increase by \(0.217\) beats per minute for each additional year of age, although there is insufficient evidence that this relationship exists in the population.

Gender also has a positive coefficient (\(B = 7.239\)) and is not statistically significant (\(p = .066\)). Given that gender is coded as \(1\) for male and \(2\) for female, this indicates that, after accounting for the other variables, heart rate after exercise is expected to be \(7.239\) beats per minute higher for females than males, although there is insufficient evidence that this relationship exists in the population.

Overall, while the model is statistically significant, heart rate before exercise appears to be the only independent variable with clear evidence of a relationship with heart rate after exercise once the effects of age and gender are taken into account.

Logistic regression

Logistic regression is used to model the relationship between one or more independent variables (which can be continuous or categorical) and a categorical dependent variable. Most commonly, the dependent variable has two categories (for example yes/no or pass/fail), although logistic regression can also be extended to outcomes with more than two categories. Like linear regression, logistic regression allows the effect of multiple independent variables on an outcome to be examined while accounting for the other variables in the model. However, rather than modelling a linear relationship, it models the probability of an outcome occurring.

For example, logistic regression can be used to examine how factors such as study time, attendance and gender influence whether a student passes or fails a unit. It does this by providing information about how well the variables explain pass status overall, as well as the effect of each variable individually while accounting for the other variables in the model.

Before conducting logistic regression, you need to check that the following assumptions are valid:

Assumption 1: The observations are independent, meaning that measurements for one subject have no bearing on any other subject’s measurements.

Assumption 2: There is a linear relationship between any continuous independent variables and the log odds of the dependent variable.

Assumption 3: There are no significant outliers that unduly influence the results.

In addition, note that it is also important to have a sufficiently large sample size relative to the number of independent variables included in the regression model. A commonly used rule of thumb for logistic regression is to have at least \(10\) outcome events for each independent variable.

Assuming the assumptions for logistic regression are met, and the analysis is conducted using statistical software (e.g. SPSS as in this example), the results should include the following statistics:

The first of these output tables is the Omnibus Tests of Model Coefficients table, which includes the overall \(p\) value for the model (given in the ‘Model’ row). This is used to determine whether the model as a whole is statistically significant. In this case, the \(p\) value is less than \(.001\), indicating that the model is statistically significant at the \(.05\) level.

The second output table is the Model Summary table, which includes pseudo \(R^2\) values (Cox & Snell and Nagelkerke \(R^2\)). While these both provide an indication of how well the model explains the variation in the dependent variable, the Nagelkerke \(R^2\) value is typically preferred, as it is scaled to range from \(0\) to \(1\) and therefore provides a more interpretable estimate. In this case, the Nagelkerke \(R^2\) value of \(.605\) indicates that the independent variables explain a substantial amount of the variation in pass status, although a proportion remains unexplained.

The final output table is the Variables in the Equation table, which shows the effect of each independent variable on the dependent variable while accounting for the other variables in the model. It includes the regression coefficients (\(B\)), their associated \(p\) values, and odds ratios (\(Exp(B)\)). While the regression coefficients are used to construct the model, the odds ratios are typically used for interpretation, as they indicate how the odds of the outcome change for a one-unit increase in the independent variable.

In this example, study time has an odds ratio of \(2.125\), but is not statistically significant (\(p=.661\)). This indicates that, after accounting for attendance and gender, each additional hour of study time is associated with approximately a doubling of the odds of passing the unit, although there is insufficient evidence that this relationship exists in the population.

Attendance has an odds ratio of \(1.066\), but is not statistically significant (\(p=.895\)). This indicates that, after accounting for the other variables,each additional percentage point in attendance is associated with a small increase in the odds of passing, although there is insufficient evidence that this relationship exists in the population.

Gender has an odds ratio of \(2.599\), but is also not statistically significant (\(p=.351\)). Given that gender is coded as \(1\) for male and \(2\) for female, this indicates that, after accounting for the other variables, females are predicted to have over twice the odds of passing than males, although there is insufficient evidence that this difference exists in the population.

Overall, while the model is statistically significant, there is insufficient evidence that any of the individual variables are significantly associated with pass status once the other variables are taken into account.