This tutorial will go over some basics to get you started using IBM SPSS Statistics, or SPSS. We will cover reading in data, understanding variable view vs. data view, creating and recoding variables, creating graphs, and performing basic analyses. For a more involved approach to analysis with SPSS see our other tutorials. Everything in this tutorial is done using SPSS version 26.
The data used is pulled from the General Social Survey (GSS) dataset for the year 2016.
SPSS may take a minute to load when you first start it up - just be patient. Two windows will open. The “Welcome to IBM SPSS Statistics” window and an “IBM SPSS Statistics Data Editor”.
The welcome window includes some quick links that are often useful:
- You can create a new file
- Open a recent file
- See what’s new in recent updates
- Get quick access to IBM help and support
- Go to IBM SPSS tutorials
However, we’d like to learn how to do all of this without the use of the welcome window, so we will close it for now.
Now we can see the data editor.
It is blank since we don’t currently have any data loaded. Notice that there are two tabs in the bottom left corner: data view and variable view. We will come back to those later. First, let’s load some data.
Reading in Data
The native data format for SPSS is .sav
or .zsav
, but SPSS can import data from Excel, CSV, SAS, Stata, and more. We will cover loading data from a .sav
file, and loading excel/csv files.
Opening a .sav
file is very simple. Go to File \(\rightarrow\) Open \(\rightarrow\) Data…
Select the file that you want to open (in our case spss-basics-data.sav
) and click Open.
Two things will happen. The data will open in the data editor:
and an output window will open:
The output window is a running log of everything you have done in your current session. If you do anything in SPSS, it will update the output window to reflect what you did. This is also where the output for any figures or analyses will appear. You can just minimize this tab for now.
Next, let’s go over how to open .xlsx
and .csv
data.
For excel data, go to File \(\rightarrow\) Import Data \(\rightarrow\) Excel…
Find your file and click Open.
The following window will open.
Under Worksheet you can select which tab you want to import if the file has multiple tabs. We only have one tab (sheet1). Then, you can select a custom range; the default is to import the entire spreadsheet. SPSS will default to reading in variable names from the first row of data, but you can uncheck the box if that is not the case. Leave everything else as is and click OK.
Opening CSV data is similar. Go to File \(\rightarrow\) Import Data \(\rightarrow\) CSV Data… Find your file and click Open.
The read CSV file window will open.
Again, you can specify whether the first column contains variable names. You can also remove leading/trailing spaces from string values (not relevant right now). The delimiter for this dataset is commas, but you could also specify semicolons or tabs. We will leave everything as the default and click OK. The data will open in a new window.
Understanding Variable View
Now that we have the data open, let’s go over what the different views are. Looking at the data editor, we see that we are in variable view (remember the tab at the bottom left corner?). Variable view is exactly as it sounds; it is a view of all the variables in the dataset. There are always 11 columns, and the number of rows is equal to the number of variables; i.e. one row per variable.
The columns, from left to right, are as follows:
- Name: Gives the variable name
- Type: Specifies if the variable is numeric, string, date, etc.
- Width: Upper limit of how many characters are in each entry
- Decimals: How many decimals to round numeric entries to
- Label: A descriptive label for the variable
- Values: User-defined value labels
- Missing: Whether any values are set to missing
- Columns: The width of the column
- Align: Specifies whether data are left, right or center aligned
- Measure: Indicates if the variable is scale, ordinal, or nominal
- Role: An optional setting to indicate how the variable will be used in analysis
Changing a variable name is very easy; simply double click on the cell with the name you want to change, and type in the new name.
Adding variable labels can be done similarly. SPSS does not allow spaces or special characters in variable names. Variable labels are helpful so that the output is easy to read.
In addition to variable labels, value labels can also be very useful when dealing with categorical data. For example, the SEX
variable is coded as 1s and 2s, where 1 represents male and 2 represents female. We can add this as a value label, which will show up on any tables or figures that we create.
For missing values, there is an automatic “System Missing” value of “.”, but some files use numeric values, e.g. -999, to represent missing responses. Setting these values will allow SPSS to correctly treat these values as missing in the analysis. Let’s look at the RACE
variable. This has possible values of 0 (inapplicable), 1 (white), 2 (black), 3 (other). We want SPSS to treat zeros as missing.
Let’s add variable labels to the following:
AGE
: AgeSEX
: SexRACE
: RaceRELIGID
: Religious Identity
Then, add value labels for race:
- 1: White
- 2: Black
- 3: Other
and value labels for RELIGID
:
- 1: Fundamentalist
- 2: Evangelical
- 3: Mainline
- 4: Liberal
- 5: None
- 6: Other
Finally, specify missing values for RELIGID
as 0, 8, and 9 and EDUC
as 99 and 98.
Your data should look like this.
This recoding will come in handy when we make our figures and tables.
Understanding Data View
Now, let’s take a look at the data view.
If you have used Excel before, this view should look familiar to you. In data view, each variable is its own column, and each row represents one entry. The 17 variables that we saw in variable view are all here, along with their corresponding values. Currently, we can see the numeric values, rather than the descriptive labels we provided. We can go to View \(\rightarrow\) Value labels, and the value labels we set will show instead.
We can also sort the data in ascending or descending order by variables. Go to Data \(\rightarrow\) sort cases.
We can select the variable to sort by - let’s go with age - and specify whether it should be descending or ascending. We’ll select ascending.
Then click OK.
You can see the data is now sorted by age.
Creating New Variables
There are multiple ways to create new variables in SPSS. The ones we will cover are Compute Variable, Recode into Same variable, and Recode into Different variables.
To use the Compute Variable window, go to Transform \(\rightarrow\) Compute Variable.
The following window will open:
Type the name of the variable you wish to create under Target Variable. Let’s create a new variable called age_std
which will be defined as age minus 18 years, so that 18 becomes zero, 19 becomes one, and so on. Select Age
and use the arrow to move it into the numeric expression box, then type - 18
. Your window should look something like this.
Click OK. In variable view, confirm that the age_std
variable was created.
We wish to create an interaction variable between age_std
and SEX
. Again, go to Transform \(\rightarrow\) Compute Variable. Name the target variable age_sex
. For the numeric expression, select age_std
from the list and click the arrow to move it over. Use an asterisk to denote multiplication, then click SEX
and use the arrow to move it over. You window should look like this.
Click OK. Again, confirm that the age_sex
variable was created in the variable view. Use the data view to make sure the values are computed correctly.
We have successfully created an interaction variable. However, note that sex is coded 1 = Male and 2 = Female. When creating an interaction with a categorical variable, interpretation is easier when the variable is coded zero and one. The next section will show how to do that for sex.
Recoding Variables
Sometimes you wish to create a new variable based on the values of another variable, or to recode those values. There are two options here:
- Recode into same variables
- Recode into different variable
Let’s recode the sex
variable from 1’s and 2’s to 0’s and 1’s. Specifically, we want \(1 \rightarrow 0\) and \(2 \rightarrow 1\). Go to Transform \(\rightarrow\) Recode into same variables.
Select the SEX
variable and use the arrow to move it into the Numeric Variables box.
Click Old and New Values. Under old value, type “1”, and under new value, type “0”, then click Add. Then, repeat this process to set old value 2 to be new value 1 and click Add.
Click Continue, then OK.
In data view you can see the recoding. However, the labels need to be updated to match the recoded values.
Go to Variable View, then click on the Values box for SEX
. Change 1 = "Male"
to 0 = "Male"
, and 2 = "Female"
to 1 = "Female"
.
Then click OK. Your new labels should now show up in the data view.
It’s generally a good idea to recode into a different variable so that you can always go back to the original coding if you need to.
Recode into different variables takes a similar approach to recode into same variable. Consider the race variable. It is often necessary to recode categorical variables into dummy variables. We can do this for race=white and race=black using recode into new variables.
Go to Transform \(\rightarrow\) Recode into different variables.
Select RACE
and use the arrow to shift it over to the Input Variable box. Under Output Variable change the name to race_white
, and the label to “Race = White”. Then click Change.
Next click Old and New Values…
We want the value of our new variable to be one if race is white, and zero if it is anything else. Recall that the race variable is coded as:
- 1: White
- 2: Black
- 3: Other
So, under Old value, we set Value to 1, and under New Value, we set Value to 1. Then click Add. Under Old value, select All other values, and under New Value set Value to 0. Then click Add. Your window should look like this.
Click Continue, then click OK. Next, we’ll create the race=black dummy variable. Go to Transform \(\rightarrow\) Recode into different variables.
Select RACE
and use the arrow to shift it over to the Input Variable box. Under Output Variable change the name to race_black
, and the label to “Race = Black”. Then click Change.
Under Old value, we set Value to 2 (since black is coded as 2 in the original race
variable), and under New Value, set Value to 1. Then click Add. Under Old value, select All other values, and then under New Value set Value to 0. Then click Add. Your window should look like this.
Click Continue, then click OK. You can see we have created two new variables (race_white
and race_black
) in the data editor window.
Descriptive Statistics
Now that we have our variables coded with variable and value labels, we may wish to look at some descriptive statistics. For categorical variables (i.e. variables with distinct groups, or categories, such as race) we will look at frequencies. For interval, or continuous, variables (such as age), we will look at the minimum, maximum, mean, and standard deviation.
To create a frequency table, go to Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Frequencies…
Let’s create a frequency table for race. Select the RACE
variable from the list and use the arrow to move it to the Variable(s) box.
Click OK. The frequency table will open in the output doc.
The first table provides the total number of valid and missing responses, if any exist; there are 2,867 responses to the race variable.
The next table provides the frequencies of each response.
- Frequency is the number of responses for that category
- Percent is the number out of the total responses (valid + missing) times 100%
- Valid Percent is the percentage based on non-missing responses (in this case, percent and valid percent are the same because there were no missing observations)
- Cumulative Percent is the percent of each response plus the percentage from previous categories
Now let’s take a look at age. Go to Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Descriptives…
Select AGE
and use the arrow to move it into the Variable(s) box.
You can also add other statistics (e.g. skew, kurtosis for evaluating whether a distribution is normal) by using the Options… button.
Leave the defaults checked for now and click Continue. Then click OK.
- N provides a count of the responses
- Minimum is the smallest response
- Maximum is the largest response
- Mean gives the average value and is used to measure central tendency
- Std. Deviation is the standard deviation, which is a measure of dispersion
Creating Graphs
There are many different graphs that SPSS can create - enough to fill multiple tutorials. However, we will just focus on a couple; histograms and bar graphs.
Histograms are used to visualize continuous data by creating “bins” for the frequency of datapoints in each section of values. Bar graphs are used to visualize categorical data by generating a bar for each category whose height is proportional to the frequency of values.
There are two main ways SPSS can create these visualizations; through the Chart Builder, and through Legacy Dialogs. First let’s cover the legacy dialogs.
Say we wish to create a histogram of Age
. Go to Graphs \(\rightarrow\) Legacy Dialogs \(\rightarrow\) Histogram…
The Histogram window will open.
Select Age
and use the arrow to move it into the Variable box.
Then click OK.
We can see the data appears to be bimodal with a peak at 30 years and another at approximately 55 years. The mean age is 49.33 years with a standard deviation of 17.905 years.
Now, let’s create a bar graph of race. Go to Graphs \(\rightarrow\) Legacy Dialogs \(\rightarrow\) Bar…
Select Simple, and Summaries for groups of cases. Then click Define.
The following window will open.
We can specify what we wish the bars to represent.
- N of cases is the number of cases in each category
- Cumulative N will add the previous categories to each subsequent bar
- % of cases is the number of cases out of the total times 100%
- Cumulative percent adds the previous category percents to each subsequent bar
- Other statistic allows you to specify another value (such as mean, minimum, etc.)
Select N of cases.
Category axis is the variable we wish to graph. Select Race
and use the arrow to move it into the category axis box.
The window should look like this:
Then click OK. The following figure will be created:
Most respondents to the survey were white, followed by black, and the fewest respondents were other.
Next, let’s create the same figures using the chart builder. The benefit of the chart builder is that it is a lot more flexible than the legacy dialogs.
Go to Graphs \(\rightarrow\) Chart Builder…
When the chart builder window first opens, it will be blank.
There are four main sections in the chart builder window.
- Section A provides the variables in your dataset
- Section B is the chart preview, which will be used to build your chart
- Section C allows you to edit the chart properties, appearance, and options (these will change depending on the type of chart you build)
- Section D is the Gallery, where you select the chart template you are starting with, basic elements (where you can edit the axes and other elements), Groups/Point ID (which can be used to add clustering/paneling/etc), and titles/footnotes (which can be used to add title/footnote elements to your chart)
In the gallery, select Histogram, then click and drag the Simple Histogram to the chart preview section.
This will automatically insert the the simple histogram template into the chart preview window. You can see there are three values that can be edited; Y-axis?, X-Axis?, and Filter?. Click on Age
in the variables window and drag it to the X-axis? box.
Setting a variable in the y-axis allows you to set histogram values rather than having SPSS calculate them. This is not relevant for us so we will leave it as is. The filter value allows you to filter the data by some other variable; Again we will leave this blank.
We can customize the color under Chart appearance, change axis labels and chart titles under Element Properties and more. However, let’s leave it as the defaults for now and click OK.
This creates the same graph as the legacy dialogs did.
Now, let’s create the bar chart using the chart builder. Again, go to Graphs \(\rightarrow\) Chart Builder…
This time, select Bar in the gallery, then click and drag Simple Bar to the chart preview section.
This time, click Race
in the variables window and drag it to the X-axis? box. Then click OK.
Again, we get the same chart as we did in the legacy dialogs. It is possible to create most charts using either method–the trade-off is simplicity (legacy dialogs) versus comprehensibility (chart builder).
Performing Analyses
Finally, we will go over how to do some basic analyses with SPSS; specifically, calculating correlations, and running a linear regression model.
Let’s take a look at the correlations between age and income. Go to Analyze \(\rightarrow\) Correlate \(\rightarrow\) Bivariate…
The Bivariate Correlations window will open. We will select Age
and INCOME
and use the arrow to move them into the Variables box. Then select what type of correlation coefficients we want calculated, we will stick with Pearson’s r. The test of significance defaults to Two-tailed, which is standard.
Click OK. You will get the following correlations table:
The values on the diagonal are 1’s, since any variable correlated with itself will be 1. We can see that age and income have a Pearson correlation of \(r = 0.021\), which is not significant at the \(\alpha = 0.05\) level (\(p = 0.251\)).
Let’s run a regression model to predict years of education based on age, sex, and the race dummy variables we created above. Go to Analyze \(\rightarrow\) Regression \(\rightarrow\) Linear…
The dependent variable is EDUC
, so use the right arrow to move that into the Dependent box. The independent variables are Age
, SEX
, race_white
, and race_black
so use the arrow to move those into the Independent(s) box.
Then click OK. We get the following output:
The first table lists the variables used in the model, and the method used to enter them into the model (default = Enter).
The next table provides the model summary, which gives the R, R-square (which is a measure of model fit), Adjusted R Square (which is a more conservative estimate of model fit), and standard error of the estimate (which is a measure of dispersion).
The ANOVA table tests whether the model as a whole is significant. We can see given Sig = 0.000 that there is at least one significant factor.
Finally, the coefficients table provides the regression model coefficients, along with their significance. Unstandardized coefficients are provided with their standard errors, as well as standardized betas. A t-test is conducted for each coefficient, and the t-value is given. Then the significance is determined (i.e. p-value). We can see that age is a significant factor (\(p=0.009\)), as is race = white (\(p<0.001\)), while sex and race = black are not.
The purpose of this tutorial has been to provide a starting point for using SPSS. For a more in-depth look at specific analyses, see our other SPSS tutorials here.