Often when you are doing your analysis you will find that it is helpful to create new variables, or to make changes to existing variables. This page details some of the transformation facilities provided by SPSS which enable you to do this, all of which are found under the Transform menu.
In brief, this page covers the following:
Note that the examples covered here make use of the Household energy consumption data.sav file, which contains fictitious data for 80 people based on a short ‘Household energy consumption’ questionnaire. If you want to work through the examples provided you can download the data file using the following link:
If you would like to read the sample questionnaire for which the data relates, you can do so using this link:
Before commencing the analysis, note that the default is for dialog boxes in SPSS to display any variable labels, rather than variable names. You may find this helpful, but if you would prefer to view the variable names instead then from the menu choose:
Sometimes you may wish to create a new variable or variables to add to your data file, either from scratch or using the data from an existing variable or variables. For example, in the sample data file you may wish to create a new variable which gives the difference between summer and winter household energy consumption for each survey participant. You can do this by choosing the following from the SPSS menu (either from the Data Editor or Output window):
Next:
If you then navigate to the Data View of the Data Editor window, you will see that a new ‘Consumption_difference’ variable has been added to the end of the data file, with the difference for each of the 80 cases determined using the numeric expression entered. You can then analyse this variable as you would any of the original variables.
Note that you can also move the new variable if wished, either in the Data View or in the Variable View, by dragging and dropping. For example, you could move the new variable to sit after the ‘q7’ variable by selecting the variable name in the Data View, then holding down the left mouse key and dragging until it is in the required spot.
As another example of when you might want to compute a new variable, consider questions q9 through q12, which all relate to satisfaction with different aspects of the participants’ electricity provider. As these questions all use the same rating system (measured on a scale of 1 to 5, with 1 indicating ‘Very unsatisfied’ and 5 indicating ‘Very satisfied’), the four variables representing these questions can be combined to come up with an overall satisfaction score.
One way of doing this is by adding all the variables together to create a score out of 20. To do this, you could enter the new variable name Overall_satisfaction and the numeric expression q9+q10+q11+q12:
If you run the Frequencies procedure on this new variable (as described in the Descriptive statistics page of this module) you will see that there are only 78 satisfaction scores, whereas there are 80 cases in the data file. Looking at the actual data reveals why; the data in row 30 is missing for all four of the variables ‘q9’ to ‘q12’, and the data in row 31 is missing for variables ‘q10’ and ‘q12’. Since the numeric expression shown above only calculates new values for those cases that have complete data, the new variable has not been computed for rows 30 and 31.
Sometimes this will be what you want, but other times you will require data for the new variable regardless of whether some of the data is missing or not (note that if there is missing data for all the variables, there will automatically be missing data for the new variable). To do this simply requires that a different numeric expression is used within the Compute variable procedure, which makes use of the sum function.
For example, you could alter the numeric expression for the variable you have just created to sum(q9 to q12) (note that the word ‘to’ can be used between the variables in this case as they occur one after the other in the data file; if this isn’t the case, you would need to list the variable names separated by commas instead):
With this new numeric expression, there is now a value for the ‘Overall_satisfaction’ variable in row 31.
You may also like to experiment with other formulas. For example, if you wanted to calculate an average overall satisfaction score instead you could also try using two different, similar numeric expressions:
Regardless of which formula you choose to use, the new variable can then be analysed in the usual way.
Sometimes you may wish to recode an existing categorical variable, most likely to reduce the number of categories by combining existing ones together. For example, in the sample data file you may wish to recode the ‘q8’ variable to reduce the number of categories from five to three. You can do this by choosing the following from the SPSS menu (either from the Data Editor or Output window):
The second part of the process is to decide how the categories of the existing variable are going to map to categories of the new variable. Sometimes this can require quite a bit of thought and planning, but with so few categories in this example it is more straightforward. In particular, the existing categories lend themselves to being recoded into three new categories (‘Agree’, ‘Neutral’ and ‘Disagree’), as follows:
Existing category | New category |
---|---|
1 (Strongly disagree) | 1 (Disagree) |
2 (Disagree) | 1 (Disagree) |
3 (Neutral) | 2 (Neutral) |
4 (Agree) | 3 (Agree) |
5 (Strongly agree) | 3 (Agree) |
To specify this in SPSS, do the following in the Recode into Different Variables: Old and New Values dialogue box:
Next:
If you then navigate to the Data View of the Data Editor window, you will see that a new ‘q8_recoded’ variable has been added to the end of the data file (note that you can move it if wished, either in the Data View or in the Variable View, by dragging and dropping). The category values do not currently have any labels (e.g. ‘Disagree’, ‘Neutral’ and ‘Agree’), and you may need to change the variable Measure (from Nominal to Ordinal), but you can do both of these things as described in the Getting started page of this module.
Once you have finished setting up the variable, you can analyse it in the usual way. For example, you could run the Frequencies procedure (as described in the Descriptive statistics page of this module) on the new variable, which should result in the following table:
Although SPSS does allow alphabetic/string information to be entered as part of the data file, the more in-depth statistical analysis procedures require numeric data only (even if those numbers are simply codes or values representing categories).
At the questionnaire design stage it may be very difficult to anticipate the responses that will be given though, so creating a tick-box type question can be too complicated or restrictive. Hence allowing open-ended responses may be preferable instead, and the choice then is to either numerically code the data before keying it in, or to recode the responses once they have been entered into SPSS. This section details how to do the latter using the Automatic Recode and Recode into Different/Same Variable procedures, and uses the ‘q13’ variable in the sample data file as an example. This variable stores participant responses to the question:
What kind of hot water system do you use at your property?
The variable is defined as String under Variable View, and is a nominal variable. A frequency table of the responses is as follows:
This output shows only five different types of hot water systems, but because of different spelling and terminology and different use of upper and lower case characters, twelve different responses are listed. To reduce this twelve down to the real five, the different categories need to be combined (i.e. recoded).
To complete the first part of this two-step process, from the menus choose:
Now in the dialogue box that opens:
The resultant output should be as follows:
Note that the original responses have been sorted into alphabetical order and assigned a value from 1 to 12. The original data has been used to create the Value Labels for those values and all this has been put into a new variable at the end of the data file called ‘q13_autorecode’.
The second step of the process is then to reduce these 12 categories to the 5 required ones, using the standard Recode into Different Variables command described previously (or you could use the Recode into Same Variables command in this instance if preferred). In this case, the existing and new categories could be as follows:
Existing category | New category |
---|---|
1 (electric) | 1 (Electric) |
2 (electric) | 1 (Electric) |
3 (gas instant) | 2 (Instantaneous gas) |
4 (Gas instant) | 2 (Instantaneous gas) |
5 (Gas instantaneous) | 2 (Instantaneous gas) |
6 (gas storage) | 3 (Gas storage) |
7 (Gas storage) | 3 (Gas storage) |
8 (Heat pump) | 4 (Heat pump) |
9 (Hot water heat pump) | 4 (Heat pump) |
10 (solar) | 5 (Solar) |
11 (Solar) | 5 (Solar) |
12 (Solar hot water) | 5 (Solar) |
Sometimes it is helpful to transform a continuous variable into a categorical variable, as this provides additional analysis options. For example, in the sample data file you may wish to transform the continuous ‘q1’ variable into categories, perhaps in order to make some comparisons for different age groups.
While you can in fact do this using either of the procedures outlined above, the purpose-built procedure for this in SPSS is Visual Binning. You can make use of this by choosing the following from the SPSS menu (either from the Data Editor or Output window):
Next:
Next:
If you then navigate to the Data View of the Data Editor window, you will see that a new ‘q1_grouped’ variable has been added to the end of the data file (note that you can move it if wished, either in the Data View or in the Variable View , by dragging and dropping). You can analyse it in the usual way, for example you could run the Frequencies procedure (as described in the Descriptive statistics page of this module) on the new variable, which should result in the following table: