This part of the module covers a few extra tips and tricks you may find helpful when analysing data in SPSS. In particular, it covers the following (use the drop-down menu above to jump to a different section as required):

  • Using subsets of the data
  • More data transformations

The examples covered make use of the Household energy consumption data.sav file, which contains fictitious data for 80 people based on a short ‘Household energy consumption’ questionnaire. If you want to work through the examples provided you can download the data file using the following link:

If you would like to read the sample questionnaire for which the data relates, you can do so using this link:

Before commencing the analysis, note that the default is for dialog boxes in SPSS to display any variable labels, rather than variable names. You may find this helpful, but if you would prefer to view the variable names instead then from the menu choose:

  • Edit
  • Options…
  • Change the Variable Lists option to Display names

Using subsets of the data

While the default in SPSS is for all of the cases in the data file to be processed every time, this doesn’t mean you need to have separate data files for each little subset of cases in order to process them separately. Instead, you can use a filter to select and process particular subsets of your data file as required.

As an example, suppose that for reporting purposes there is a need to analyse just the female responses - temporarily ignoring the other data. To select this subset of data, choose the following from the SPSS menu (either from the Data Editor or Output window):

  • Data
  • Select cases…

Then to select cases according to certain criteria (e.g. if they are female):

  • Select If condition is satisfied
  • Click If…

The expression that defines the required condition in this case is that the ‘q2’ variable (the gender variable) is equal to the value 2 (the code representing female). To define this:

  • move the ‘q2’ variable into the box and add in the = 2

Next:

  • Click on Continue
  • Click on OK

In the Data View of the Data Editor window the cases that do not satisfy the selection criteria (i.e. those of other genders) will now not be visible, as they have been temporarily filtered out (or the row numbers will have a line through them, depending on the version of the software). Any analyses now will only report on the selected cases – the females.

For example, to find out how many females are in each of the different categories for the ‘q8’ variable (which relates to consumption reduction), run the Frequencies procedure (as described in the Descriptive statistics page of this module) on the available data. Note that the number of cases reported in the output should be 69, the number of females, and not the 80 that constitutes the full data file.

When all of the analysis of the female only data has been completed, another subset can be isolated by going through the Select Cases process again if required. Alternatively, to revert back to the whole data file don’t forget to turn the selection/filter off! To do this, choose the following from the SPSS menu (either from the Data Editor or Output window):

  • Data
  • Select cases…
  • select All cases
  • click on OK

All 80 cases are available for processing again once the selection has been turned off.

More data transformations

While the most common types of data transformations are explained in the Transformations page of this module, this section looks at two additional, specific examples.

Converting a string variable to a numeric variable

Although SPSS does allow alphabetic/string information to be entered as part of the data file, the more in-depth statistical analysis procedures require numeric data only (even if those numbers are simply codes or values representing categories).

At the questionnaire design stage it may be very difficult to anticipate the responses that will be given though, so creating a tick-box type question can be too complicated or restrictive. Hence allowing open-ended responses may be preferable instead, and the choice then is to either numerically code the data before keying it in, or to recode the responses once they have been entered into SPSS. This section details how to do the latter using the Automatic Recode and Recode into Different/Same Variable procedures, and uses the ‘q13’ variable in the sample data file as an example. This variable stores participant responses to the question:

What kind of hot water system do you use at your property?

The variable is defined as String under Variable View, and is a nominal variable. A Frequency table of the responses is as follows:

This output shows only five different types of hot water systems, but because of different spelling and terminology and different use of upper and lower case characters, twelve different responses are listed. To reduce this twelve down to the real five, the different categories need to be combined (i.e. recoded).

To complete the first part of this two-step process, from the menus choose:

  • Transform
  • Automatic Recode…

Now in the dialogue box that opens:

  • move the variable (‘q13’) into the Variables box
  • enter a name for the new variable in the New Name box (for example ‘q13_autorecode’)
  • click on Add New Name
  • select Treat blank string variables as user-missing (so that no category is created for these)
  • select OK

The resultant output should be as follows:

Note that the original responses have been sorted into alphabetical order and assigned a value from 1 to 12. The original data has been used to create the Value Labels for those values and all this has been put into a new variable at the end of the data file called ‘q13_autorecode’.

The second step of the process is then to reduce these 12 categories to the 5 required ones, using the standard Recode into Different Variables command (or you could use the Recode into Same Variables command in this instance if preferred). Instructions on how to do this are provided in the Transformations page of this module. In this case, the existing and new categories could be as follows:

Existing category New category
1 (electric) 1 (Electric)
2 (electric) 1 (Electric)
3 (gas instant) 2 (Instantaneous gas)
4 (Gas instant) 2 (Instantaneous gas)
5 (Gas instantaneous) 2 (Instantaneous gas)
6 (gas storage) 3 (Gas storage)
7 (Gas storage) 3 (Gas storage)
8 (Heat pump) 4 (Heat pump)
9 (Hot water heat pump) 4 (Heat pump)
10 (solar) 5 (Solar)
11 (Solar) 5 (Solar)
12 (Solar hot water) 5 (Solar)

Computing a new variable by adding or averaging existing variables

Part of the aim of the energy consumption questionnaire is to determine how satisfied the participants are with their electricity provider. Rather than asking this as a single question though, the information is collected through four questions relating to different aspects of the service. As these questions all use the same rating system (measured on a scale of 1 to 5, with 1 indicating ‘Very unsatisfied’ and 5 indicating ‘Very satisfied’), the four variables representing these questions (‘q9’ through to ‘q12’) can be combined to come up with an overall satisfaction score.

One way of doing this is by adding all the variables together to create a score out of 20, which can be done using the Compute variable procedure as described in the Transformations page of this module. This time, the new variable name and numeric expression could be as follows:

After clicking OK , the new variable should appear at the end of the data file.

If you run the Frequencies procedure on this new variable (as described in the Descriptive statistics page of this module) you will see that there are only 78 satisfaction scores, whereas there are 80 cases in the data file. Looking at the actual data reveals why; the data in row 30 is missing for all four of the variables ‘q9’ to ‘q12’, and the data in row 31 is missing for variables ‘q10’ and ‘q12’. Since the numeric expression shown above only calculates new values for those cases that have complete data, the new variable has not been computed for rows 30 and 31.

Sometimes this will be what you want, but other times you will require data for the new variable regardless of whether some of the data is missing or not (note that if there is missing data for all the variables, there will automatically be missing data for the new variable). To do this simply requires that a different numeric expression is used within the Compute variable procedure; namely the sum function. You could alter the numeric expression for the variable you have just created to use this function instead, as follows (note that the word ‘to’ can be used between the variables in this case as they occur one after the other in the data file; if this isn’t the case, you would need to list the variable names separated by commas instead):

With this new numeric expression, there is now a value for the ‘Overall_satisfaction’ variable in row 31.

You may also like to experiment with other formulas. For example, if you wanted to calculate an average overall satisfaction score instead you could also try using two different, similar numeric expressions:

  • The numeric expression (q9 + q10+ q11 + q12)/4 will again have missing data for both rows 30 and 31.
  • The numeric expression mean(q9 to q12) (or mean(q9, q10, q11, q12) if the variables are not in order) will have missing data only for row 30. For row 31, the average will be calculated by dividing by 2 instead of 4, since there are only two variables with data.

Regardless of which formula you choose to use, the new variable can then be analysed in the usual way.