Getting started - Introduction to Stata - UniSkills

When using Stata to help with your statistical analysis, it is highly likely that you will already have your data in some other electronic format. For example, in Excel or in an online survey tool such as Qualtrics. If this is the case you do not have to enter your data manually into Stata, and instead should just ensure that your variables are set up correctly. If you do not already have it in an electronic format though, you will need to enter the data as well as set up the variables. Either way, it is recommended that you write your commands in a do-file.

In brief, this page covers the following:

The key components of the Stata user interface
How to use Stata syntax to interact with the software, either by writing commands in the Command window or in a do-file
How to check and set your working directory
How to open an existing dataset in Stata, including by importing from other software
How to enter and save data in Stata
How to set up variables in Stata
How to save your dataset and results in Stata

The Stata user interface

When you open Stata you should see a user interface with four labelled windows; the History, Command, Variables, and Properties windows. In addition, the large window in the centre of the interface where the results are displayed is the Results window.

Click on the relevant part of the image below to learn more about each of these windows:

In addition to the five windows displayed, you can also open the Data Editor and Do-file Editor windows:

The Data Editor window is used either to edit or browse the data in memory. To open it using the menu, you can select the Data tab, select Data Editor and then choose Data Editor (Edit) or Data Editor (Browse). Alternatively, you can type either the edit or browse commands in the Command window.
The Do-file Editor is used to create a do-file. You can open it using the menu by selecting the Window tab, selecting Do-file Editor and New Do-file Editor, or by typing the doedit command in the Command window.

Using Stata syntax

Although you can interact with Stata using the menus and dialog boxes, this module focuses on writing and using syntax - that is, entering and running commands. When doing this, some important things to keep in mind about the commands are as follows:

They are case sensitive, and need to be written all in lower case
They are written using American English spelling
You can use * or // to add single-line comments and /* */ to add multi-line comments as required

You can type your commands directly into the Command window, but a do-file enables you to keep a permanent record of everything you have done. In addition, it allows you to run multiple commands at once, to easily reproduce your results, and to share your workflow with others. To create one, open the Do-file Editor as detailed above, select File and Save as…, give the file an appropriate name and save it in a suitable location (for example, in your working directory). You can then write any commands in the do-file in the same way as you would in the Command window.

When you are ready to run the commands in your do-file, click the ‘Execute’ icon at the top of the screen (the image of a document with a play icon). Alternatively, you can also run it using the do command in the Command window as follows, replacing the name of the do-file with the appropriate name (and ensuring it is in your working directory - or changing the working directory or specifying the file path if not):

do "do-fileName.do"

Note that if you only want to run a certain command or commands in the do-file instead, rather than the whole thing, you can do so by highlighting the relevant line or lines and clicking the ‘Execute’ icon.

Managing your working directory

Your working directory is the default location for any files you read into or save in Stata. You can read and save files in locations other than your working directory, but you will need to specify the file path each time in this case.

You can check what your current working directory is by typing the pwd command in the Command window. If you find you need to change it you can use the cd command as follows, replacing the file path with the correct file path (note this does not need to be in the C drive):

cd “C:\Folder Name\Folder Name”

Note that if you are working with a do-file, as detailed above, it is recommended that you save it in your working directory. This way you should only need to set your working directory the first time you use the do-file, as after that Stata will automatically set the working directory to the location of the do-file when you open it directly (that is, by double clicking on the file to open it rather than opening Stata first). If you don’t open the do-file directly though, you will likely need to set your working directory each time (in which case, you may like to include the command to do this at the top of your do-file).

If you would like to create a new folder within your working directory you can do this using the mkdir command. For example, you could create a new folder within your working directory called Results, to save your results, as follows:

mkdir Results

Opening a dataset

To open an existing dataset in Stata, you can utilise the use command. Do so as follows, replacing the name of the dataset with the correct name (and ensuring it is in your working directory - or changing the working directory or specifying the file path if not):

use “datasetName.dta”, clear

Note that the addition of the clear option in this command tells Stata to remove any data already in memory before loading the new dataset. Stata only allows one dataset at a time, so this helps avoid errors. You can also use clear on its own at any time to empty the memory.

If you wish to import data from another source (such as Excel, SPSS or SAS) you will need to use one of the commands below instead (again, after ensuring it is in your working directory or changing the working directory or specifying the file path if not).

Importing from Excel
You can import an Excel dataset using the following command, replacing the name of the Excel workbook with the correct name:

import excel “workbookName.xlsx”

In addition, you can add one or more of the below options after a comma to modify the behaviour of this command.

Option 1. Add the following to specify the name of a particular sheet to be imported, replacing the name of the sheet with the correct name (noting that this is only required when there is more than one sheet in the workbook, and the one you want to import is not the first one):

sheet(“sheetName”)

Option 2. Add the following to specify a particular range of cells to import, replacing the range of cells with the correct name (noting that this is only required when you do not want to import all of the data):

cellrange(A2:F20)

Option 3. Add the following to use the header row of the Excel spreadsheet as the variable names and labels in Stata:

firstrow

Importing from SPSS
You can import an SPSS dataset using the following command, replacing the name of the dataset with the correct name:

import spss “datasetName.sav”

Importing from SAS
You can import a SAS dataset using the following command, replacing the name of the dataset with the correct name:

import sas “datasetName.sad7bdat”, bcat(“value labels file”)

Finally, if you wish to import data from an online survey tool such as Qualtrics, you will need to export it from the survey tool. While some tools may have the functionality to export to a Stata data file, Qualtrics does not, and therefore you will need to export it to an Excel file or SPSS file before importing as described.

Entering and saving data

If you don’t already have your data in an electronic format you can enter it directly in Stata instead. The data we will enter here has also been used for the examples in the next few pages of this module, and comes from the following simple survey:

SAMPLE QUESTIONNAIRE
Please complete this questionnaire by circling your response or by writing your answer on the line provided. Thank you for your co-operation in providing the information.

How old are you?________
Which gender do you identify as?

Male Female Non-binary Prefer not to say Prefer to self-describe:________
Do you have any children?

Yes No
What is your household’s average daily energy consumption (in kWh) in summer?________
What is your household’s average daily energy consumption (in kWh) in winter?________
How many people live in your household?________
Would you like to reduce your household’s energy consumption?

Strongly disagree Disagree Neutral Agree Strongly agree

Note that as each of these questions only allows for one response, there will be one variable per question and hence seven variables. However, any other variables which allow multiple responses will need to have one variable for each possible response.

To enter data for these seven variables, or any others, in Stata, you can use the input command. When you do this, you will need to specify a name for each of the variables. These should be representative of the data, and each one has to be unique. Whilst variable names can be anything you choose, there are some rules to follow:

The maximum number of characters in each name is 32
The characters can only be letters, numbers, or underscores (spaces or other characters are not allowed), and the first character cannot be a number
The variable names are case sensitive, so for example Age, age and AGE are all different variable names

For this example, and for any other data you enter, these names could be related to the questions (for example ‘Age’ for the first variable), they could be the question numbers (for example ‘Q1’; if you export data from Qualtrics the variable names will be the question numbers), or they could be anything else that makes sense to you. Here we will name the variables ‘age’, ‘gender’, ‘children’, ‘summer_consumption’, ‘winter_consumption’, ‘household_size’ and ‘consumption_reduction’.

We will enter the following data for 10 (fictional) individuals who completed the survey:

age	gender	children	summer_consumption	winter_consumption	household_size	consumption_reduction
21	1	2	17	18	2	4
35	1	1	26	22	4	5
29	2	1	25	28	5	3
46	3	1	19	20	3	4
32	2	2	12	13	1	2
25	2	2	16	18	2	4
41	2	1	26	27	6	2
	2	2	15	16	2	3
26	1	2	32	29	5	5
48	2	1	40	33	7	5

Remove clear command from below but mention it again to be used if required

To enter this data, and specify the variable names, type the following in the Command window. When you do this, note that the clear command clears any data that is already stored in memory, while the end command signifies that you have finished entering data. Furthermore, the . indicates that there is missing numeric data (for missing string data, use “” instead):

clear
input age gender children summer_consumption winter_consumption household_size consumption_reduction
21 1 2 17 18 2 4
35 1 1 26 22 4 5
29 2 1 25 28 5 3
46 3 1 19 20 3 4
32 2 2 12 13 1 2
25 2 2 16 18 2 4
41 2 1 26 27 6 2
. 2 2 15 16 2 3
26 1 2 32 29 5 5
48 2 1 40 33 7 5
end

Once you have done this, you should see the names of the variables appear in the Variables window at the top right of the screen. In addition, you can view or edit the data using the Data Editor window as required.

Setting up variables

Regardless of whether you import data from another source or enter it from scratch, you will need to make sure that your variables are set up correctly by specifying their properties. The following sections detail how to do this.

Naming variables

If you need to rename a variable or variables, you can do so using the rename command. To rename a single variable, follow this command with the name of the old variable and then the new variable. For example, to rename the ‘consumption_reduction’ variable to ‘Consumption_reduction’ (recalling that Stata is case sensitive) you would type the following:

rename consumption_reduction Consumption_reduction

To rename more than one variable at a time, follow the rename command with the names of the old and new variables in brackets instead. For example, to rename the ‘summer_consumption’, ‘winter_consumption’ and ‘household_size’ variables to ‘Summer_consumption’, ‘Winter_consumption’ and ‘Household_size’ respectively, type the following:

rename (summer_consumption winter_consumption household_size) (Summer_consumption Winter_consumption Household_size)

Alternatively, note that you can also change the case of variables by following the rename command with the old variable names and then either the upper (to change to all uppercase), lower (to change to all lowercase) or proper (to capitalise the first letter of each word) option after a comma. For example, to capitalise the first letter of the ‘age’, ‘gender’ and ‘children’ variables, type the following:

rename age gender children, proper

You should see the results of these variable name changes in the Variables window at the top right of the screen.

Assigning labels to variables

Because of the limit on the number of characters used in a variable name, the names can sometimes be rather cryptic. For example, the (newly renamed) ‘Summer_consumption’ and ‘Winter_consumption’ variables don’t provide the complete picture about the nature of the variables in the example above. Adding some extra labelling, though, can make all the difference. To add this extra information you can make use of the label variable command. For example, to add labels to these two variables type the following:

label variable Summer_consumption "Household energy consumption in kWh (summer)"
label variable Winter_consumption "Household energy consumption in kWh (winter)"

You should then see these labels next to the corresponding variable names in the Variables window at the top right of the screen.

Assigning labels to values for categorical variables

When entering data into Stata, the convention is to use numbers to represent categories for categorical data (visit the Data and variable types page of the Introduction to statistics module for more information on categorical data). This avoids any potential typos when entering category names, and makes the statistical analysis more straightforward. These numbers can be anything that make sense to you, but it is important that you let Stata know what category each number represents. You can do this by adding value labels for each categorical variable, which is a two-step process in Stata.

The first step requires using the label define command to define a value label set (that is, a list of the different numerical values to be used, and their corresponding category names). For example, you could define a value label set called ‘Gender_label’ (to be used with the ‘Gender’ variable in this example) as follows:

label define Gender_label 1 "Male" 2 "Female" 3 "Non-binary" 4 "Prefer not to say"

Once you have defined the value label set, you can apply it to a variable (or to multiple variables separately) using the label values command. For example, you can apply the ‘Gender_label’ set to the ‘Gender’ variable as follows:

label values Gender Gender_label

Once you have set up the value labels for the ‘Gender’ variable, do a similar thing for the other two categorical variables in this example (‘Children’ and ‘Consumption_reduction’) by running the following commands:

label define Children_label 1 "Yes" 2 "No"
label values Children Children_label
label define Consumption_reduction_label 1 "Strongly disagree" 2 "Disagree" 3 "Neutral" 4 "Agree" 5 "Strongly agree"
label values Consumption_reduction Consumption_reduction_label

Note that you can view the name of the value label set applied to each variable by selecting the name of the variable in the Variables window at the top right of the screen, and then viewing the variable properties in the Properties window at the bottom right of the screen. Or if preferred, you can use the ‘describe’ command as follows:

describe Gender

Either way, once you know the name of a variable label set you can use the label list command to view the values and corresponding labels for that set as required. For example:

label list Gender_label

Alternatively, you can just use the label list command with no variable label sets specified to list the details of all the label sets that have been created.

Setting the data type

Another important change to make for each variable is to tell the software what type of data it contains. For a numeric variable the default type is a float, but there are also four other options to choose from, namely byte, int, long and double. These types have the following properties:

Type	Stores	Minimum	Maximum	Bytes
byte	integers	\(-127\)	\(100\)	\(1\)
int	integers	\(-32 767\)	\(32 740\)	\(2\)
long	integers	\(-2 147 483 647\)	\(2 147 483 620\)	\(4\)
float	decimals (up to 7 digits)	\(-1.70141173319 \times 10^{38}\)	\(1.70141173319 \times 10^{38}\)	\(4\)
double	decimals (up to 16 digits)	\(-8.9884656743 \times 10^{307}\)	\(8.9884656743 \times 10^{307}\)	\(8\)

While the float type can continue to be used for all the variables in this dataset, it does use up additional unnecessary memory by allowing for larger numbers with decimal places. Therefore, you may wish to change some or all of the variables to type byte or int, which you can do with the recast command. For example, you can recast the ‘Gender’, ‘Children’, ‘Household_size’ and ‘Consumption_reduction’ variables to bytes and the ‘Age’, ‘Summer_consumption’ and ‘Winter_consumption’ variables to ints (just in case larger numbers are required) as follows:

recast byte Gender Children Household_size Consumption_reduction
recast int Age Summer_consumption Winter_consumption

Note that you can view the type of each variable by selecting the name of the variable in the Variables window at the top right of the screen, and then viewing the variable properties in the Properties window at the bottom right of the screen. Alternatively, you can use the describe command again.

Furthermore, note that string variables are stored as type str# (for example, str1, str10 or str50), where the number represents the maximum length of the string (alternatively, using strL allows for up to 2 000 000 000 characters). So you would specify this value as appropriate in the case of a string variable, again being mindful of using up unnecessary memory.

Formatting a variable

You can change how a variable is displayed, both in the Data Editor window and in output relating to that variable, using the format command. The default format is %9.0g, which indicates that:

the format is general numeric (that is, it automatically chooses between fixed and scientific notation as appropriate, and removes trailing zeros after the decimal point)
there are no commas in large numbers (for example 100000)
the field width is nine (that is, there are nine spaces reserved for displaying each value)

However, you can change this to:

be of a fixed numeric format (that is, a format that always shows numbers in standard decimal notation, and where you choose the number of decimal places), by replacing the g with an f
include commas in large numbers (for example 1,000,000), by adding a c on the end
adjust the field width, by changing the nine to a different value

In addition, if you are using a fixed numeric format you can also edit the number after the decimal place to indicate how many decimal places to include in the output (when using a general numeric format, this number refers to the number of significant figures instead, and it is usually kept at zero to allow Stata to decide how many is appropriate).

For example, the following command formats the ‘Summer_consumption’ and ‘Winter_consumption’ variables so that they have a fixed numeric format, with commas in large numbers, a field width of six and two decimal places:

format Summer_consumption Winter_consumption %6.2fc

Note that you can view the format of each variable by selecting the name of the variable in the Variables window at the top right of the screen, and then viewing the variable properties in the Properties window at the bottom right of the screen. Alternatively, you can use the describe command again.

Adding notes for a variable

Finally, you can also add additional notes for a variable using the notes command if you want to provide more details than that specified in a label. For example, you could use the following command to add a note to the ‘Consumption_reduction’ variable:

notes Consumption_reduction : "Participants were asked whether they would like to reduce their household’s energy consumption."

To view the notes for this variable you can then use the notes command as follows:

notes Consumption_reduction

Alternatively, you can just use the notes command without any variable names to view all the notes in the dataset.

Saving

Once you have set up your dataset it can be saved using the save command (when you do this it will be automatically saved in your working directory, unless you specify a different file path):

save “datasetName.dta”

Note that if you have already saved the dataset and want to overwrite it, you will need to add the replace option after a comma in the above command.

If you would like to save your output (that is, the statistics, tables and messages that appear in the Results window) to a file called a log file, you can use the log command. You just need to ensure that you run this command before you obtain your results, either in the Command window or in your do-file. The following version of this command will save the results to a file called ‘results.log’ which is saved in the Results folder in the working directory (as created previously):

log using "Results/results.log", replace

Note that the addition of the replace option in this command tells Stata to overwrite an existing log file with the new output. If you want to add new results to the end of an existing log file instead, use the append option instead of replace.

Either way, if you do create a log file make sure to include a command to close it once you’re finished, as follows:

log close