Data & variable types - Introduction to statistics - UniSkills

This page details some important concepts that will be referred to in subsequent pages, including what data and variables are, and how to distinguish between different types.

In brief, it covers the following:

A definition of data
A definition of a variable
How to distinguish between categorical and continuous data
How to distinguish between independent and dependent variables

What is data?

The word data refers to observations and measurements which have been collected in some way, often through research.

Data that is recorded as numbers (and therefore measures quantities) is quantitative data, while data that is recorded as text (and therefore records qualities) is qualitative data. Quantitative data can be analysed using statistics, as can qualitative data that records qualities in terms of different categories (for example what hair colour someone has, what country someone was born in, what their marital status is, etc.), as opposed to data that records qualities in terms of thoughts, feelings and opinions.

It is the former two types of data that you will be working with in this module, and shortly we will introduce some other terms that are typically used in statistics to describe data of this nature.

What is a variable?

Variables are the characteristics or attributes that you are observing, measuring and recording data for. Some examples include height, weight, eye colour, dog breed, climate, electrical conductivity, customer service satisfaction and class attendance, just to name a few.

As the word suggests, the value of a variable varies from one subject (i.e. person, place or thing) to another. For example, the variable ‘height’ could have a value of \(170\textrm{cm}\) for one person, \(163\textrm{cm}\) for another person and \(154\textrm{cm}\) for another person, while the variable ‘climate’ could have a value of arid for one city, tropical for another city and Mediterranean for another city and the variable ‘class attendance’ could have the value \(17\) for one class, \(25\) for another class and \(32\) for another class, etc.

Categorical and continuous data

Choosing the correct statistic or statistical test to analyse your data depends on the type of data, and hence type of variable(s), so it is very important to be able to distinguish between these. Most of the time you will simply need to classify your data (and hence variables) as either categorical or continuous, but each of these types can also be sub-classified. Definitions and sub-classifications for each are as follows:

Categorical data is data which is grouped into categories, such as data for a ‘gender’ or ‘smoking status’ variable. Categorical data can be further classified as:

Nominal when the categories do not have an order, such as for a ‘marital status’ variable. Furthermore, if there are only two categories then the terms binary and/or dichotomous are sometimes used.
Ordinal when the categories do have an order, such as for a ‘satisfaction level’ variable.

Continuous data is data which is measured on a continuous numerical scale and which can take on a large number of possible values, such as data for a ‘weight’ or ‘distance’ variable. Continuous data can be further classified as:

Interval when it does not have an absolute zero, and negative numbers also have meaning, such as for a ‘temperature in degrees Celsius’ variable.
Ratio when it does have an absolute zero, and negative numbers don’t have meaning, such as for a ‘height’ variable.

One other type of data that you might hear mentioned is discrete data , which can be defined as follows:

Discrete data measures counts or numbers of events, such as data for a ‘class attendance’ variable. So while it is numerical data it is not measured on a continuous numerical scale, and hence doesn’t fit neatly into either of the classifications above. Instead you can think of it as a special kind of data that can be treated as either categorical or continuous, depending on how many values are possible.

It is usually treated as continuous data, but if there are only a small number of values (such as for a ‘number of units studied in semester one’ variable) you might choose to treat them as categories instead.

One final thing to note on this topic is that any continuous data can be turned into categorical data by creating categories out of it, which can be useful if you want to analyse your continuous data using statistics and statistical tests designed for categorical data. For example, continuous data for a ‘weight’ variable could be turned into categorical data by creating categories of \(50-59\textrm{kg}\), \(60-69\textrm{kg}\), \(70-79\textrm{kg}\), etc. You can’t go the other way around and turn categorical data into continuous data though, so if you have the choice then for maximum flexibility it is generally preferable to collect continuous data.

If you would like to practise classifying variables according to the type of data they contain, have a go at one or both of the following activities.

Activity: Decide whether you would treat each of the following variables as categorical or continuous for the purposes of analysis, by dragging them to the appropriate boxes (click on the ‘i’ symbol if you need a tip). Note that while categories can always be created for continuous data, you should put variables in the ‘continuous’ box if it is possible to collect continuous data for them.

Activity: Decide whether the variables shown contain data that is nominal, ordinal, discrete or continuous, by dragging them to the appropriate boxes (click on the ‘i’ symbol if you need a tip).

Independent and dependent variables

When using bivariate analysis to test for relationships between pairs of variables, as detailed later in this module, you will usually have one independent and one dependent variable. Definitions of these are as follows:

The independent variable (otherwise known as the predictor variable) is the one that potentially influences, affects or predicts the other variable. For example, if you are investigating whether age influences income, then age is the independent variable.

The dependent variable (otherwise known as the outcome variable) is the one that is potentially influenced, affected or predicted. For example, if you are investigating whether income can be predicted by age, then income is the dependent variable.

It is important to be able to distinguish between these two types, as it determines where you put each variable in tables and graphs. Keep in mind though that which variable is which depends on the context, and while some variables (for example age) will always be independent, other variables (for example smoking status) might be independent or dependent depending on what you are trying to test.

If you would like to practise distinguishing between independent and dependent variables, have a go at the following activity.