This tutorial will go over some basics to get you started using IBM SPSS Statistics, or SPSS. We will cover reading in data, understanding variable view vs. data view, creating and recoding variables, creating graphs, and performing basic analyses. For a more involved approach to analysis with SPSS see our other tutorials. Everything in this tutorial is done using SPSS version 26.

The data used is pulled from the General Social Survey (GSS) dataset for the year 2016.

SPSS may take a minute to load when you first start it up - just be patient. Two windows will open. The “Welcome to IBM SPSS Statistics” window and an “IBM SPSS Statistics Data Editor”.

The welcome window includes some quick links that are often useful:

- You can create a new file
- Open a recent file
- See what’s new in recent updates
- Get quick access to IBM help and support
- Go to IBM SPSS tutorials

However, we’d like to learn how to do all of this without the use of the welcome window, so we will close it for now.

Now we can see the data editor.

It is blank since we don’t currently have any data loaded. Notice that there are two tabs in the bottom left corner: data view and variable view. We will come back to those later. First, let’s load some data.

### Reading in Data

The native data format for SPSS is `.sav`

or `.zsav`

, but SPSS can import data from Excel, CSV, SAS, Stata, and more. We will cover loading data from a `.sav`

file, and loading excel/csv files.

Opening a `.sav`

file is very simple. Go to **File \(\rightarrow\) Open \(\rightarrow\) Data…**

Select the file that you want to open (in our case `spss-basics-data.sav`

) and click **Open**.

Two things will happen. The data will open in the data editor:

and an output window will open:

The output window is a running log of everything you have done in your current session. If you do anything in SPSS, it will update the output window to reflect what you did. This is also where the output for any figures or analyses will appear. You can just minimize this tab for now.

Next, let’s go over how to open `.xlsx`

and `.csv`

data.

For excel data, go to **File \(\rightarrow\) Import Data \(\rightarrow\) Excel…**

Find your file and click **Open**.

The following window will open.

Under **Worksheet** you can select which tab you want to import if the file has multiple tabs. We only have one tab (sheet1). Then, you can select a custom range; the default is to import the entire spreadsheet. SPSS will default to reading in variable names from the first row of data, but you can uncheck the box if that is not the case. Leave everything else as is and click **OK**.

Opening CSV data is similar. Go to **File \(\rightarrow\) Import Data \(\rightarrow\) CSV Data…** Find your file and click **Open**.

The read CSV file window will open.

Again, you can specify whether the first column contains variable names. You can also remove leading/trailing spaces from string values (not relevant right now). The delimiter for this dataset is commas, but you could also specify semicolons or tabs. We will leave everything as the default and click **OK**. The data will open in a new window.

### Understanding Variable View

Now that we have the data open, let’s go over what the different views are. Looking at the data editor, we see that we are in variable view (remember the tab at the bottom left corner?). Variable view is exactly as it sounds; it is a view of all the variables in the dataset. There are always 11 columns, and the number of rows is equal to the number of variables; i.e. one row per variable.

The columns, from left to right, are as follows:

- Name: Gives the variable name
- Type: Specifies if the variable is numeric, string, date, etc.
- Width: Upper limit of how many characters are in each entry
- Decimals: How many decimals to round numeric entries to
- Label: A descriptive label for the variable
- Values: User-defined value labels
- Missing: Whether any values are set to missing
- Columns: The width of the column
- Align: Specifies whether data are left, right or center aligned
- Measure: Indicates if the variable is scale, ordinal, or nominal
- Role: An optional setting to indicate how the variable will be used in analysis

Changing a variable name is very easy; simply double click on the cell with the name you want to change, and type in the new name.

Adding variable labels can be done similarly. SPSS does not allow spaces or special characters in variable names. Variable labels are helpful so that the output is easy to read.

In addition to variable labels, *value* labels can also be very useful when dealing with categorical data. For example, the `SEX`

variable is coded as 1s and 2s, where 1 represents male and 2 represents female. We can add this as a value label, which will show up on any tables or figures that we create.

For missing values, there is an automatic “System Missing” value of “.”, but some files use numeric values, e.g. -999, to represent missing responses. Setting these values will allow SPSS to correctly treat these values as missing in the analysis. Let’s look at the `RACE`

variable. This has possible values of 0 (inapplicable), 1 (white), 2 (black), 3 (other). We want SPSS to treat zeros as missing.

Let’s add variable labels to the following:

`AGE`

: Age`SEX`

: Sex`RACE`

: Race`RELIGID`

: Religious Identity

Then, add value labels for race:

- 1: White
- 2: Black
- 3: Other

and value labels for `RELIGID`

:

- 1: Fundamentalist
- 2: Evangelical
- 3: Mainline
- 4: Liberal
- 5: None
- 6: Other

Finally, specify missing values for `RELIGID`

as 0, 8, and 9 and `EDUC`

as 99 and 98.

Your data should look like this.

This recoding will come in handy when we make our figures and tables.

### Understanding Data View

Now, let’s take a look at the data view.

If you have used Excel before, this view should look familiar to you. In data view, each variable is its own column, and each row represents one entry. The 17 variables that we saw in variable view are all here, along with their corresponding values. Currently, we can see the numeric values, rather than the descriptive labels we provided. We can go to **View \(\rightarrow\) Value labels**, and the value labels we set will show instead.

We can also sort the data in ascending or descending order by variables. Go to **Data \(\rightarrow\) sort cases**.

We can select the variable to sort by - let’s go with age - and specify whether it should be descending or ascending. We’ll select ascending.

Then click **OK**.

You can see the data is now sorted by age.

### Creating New Variables

There are multiple ways to create new variables in SPSS. The ones we will cover are *Compute Variable*, *Recode into Same* variable, and *Recode into Different* variables.

To use the *Compute Variable* window, go to **Transform \(\rightarrow\) Compute Variable**.

The following window will open:

Type the name of the variable you wish to create under **Target Variable**. Let’s create a new variable called `age_std`

which will be defined as age minus 18 years, so that 18 becomes zero, 19 becomes one, and so on. Select `Age`

and use the arrow to move it into the numeric expression box, then type `- 18`

. Your window should look something like this.

Click **OK**. In variable view, confirm that the `age_std`

variable was created.

We wish to create an interaction variable between `age_std`

and `SEX`

. Again, go to **Transform \(\rightarrow\) Compute Variable**. Name the target variable `age_sex`

. For the numeric expression, select `age_std`

from the list and click the arrow to move it over. Use an asterisk to denote multiplication, then click `SEX`

and use the arrow to move it over. You window should look like this.

Click **OK**. Again, confirm that the `age_sex`

variable was created in the variable view. Use the data view to make sure the values are computed correctly.

We have successfully created an interaction variable. However, note that sex is coded 1 = Male and 2 = Female. When creating an interaction with a categorical variable, interpretation is easier when the variable is coded zero and one. The next section will show how to do that for sex.

### Recoding Variables

Sometimes you wish to create a new variable based on the values of another variable, or to recode those values. There are two options here:

- Recode into same variables
- Recode into different variable

Let’s recode the `sex`

variable from 1’s and 2’s to 0’s and 1’s. Specifically, we want \(1 \rightarrow 0\) and \(2 \rightarrow 1\). Go to **Transform \(\rightarrow\) Recode into same variables**.

Select the `SEX`

variable and use the arrow to move it into the *Numeric Variables* box.

Click **Old and New Values**. Under old value, type “1”, and under new value, type “0”, then click **Add**. Then, repeat this process to set old value 2 to be new value 1 and click **Add**.

Click **Continue**, then **OK**.

In data view you can see the recoding. However, the labels need to be updated to match the recoded values.

Go to **Variable View**, then click on the **Values** box for `SEX`

. Change `1 = "Male"`

to `0 = "Male"`

, and `2 = "Female"`

to `1 = "Female"`

.

Then click **OK**. Your new labels should now show up in the data view.

It’s generally a good idea to recode into a different variable so that you can always go back to the original coding if you need to.

Recode into different variables takes a similar approach to recode into same variable. Consider the race variable. It is often necessary to recode categorical variables into dummy variables. We can do this for race=white and race=black using recode into new variables.

Go to **Transform \(\rightarrow\) Recode into different variables**.

Select `RACE`

and use the arrow to shift it over to the *Input Variable* box. Under *Output Variable* change the name to `race_white`

, and the label to “Race = White”. Then click **Change**.

Next click **Old and New Values…**

We want the value of our new variable to be one if race is white, and zero if it is anything else. Recall that the race variable is coded as:

- 1: White
- 2: Black
- 3: Other

So, under *Old value*, we set *Value* to 1, and under *New Value*, we set *Value* to 1. Then click **Add**. Under *Old value*, select **All other values**, and under *New Value* set *Value* to 0. Then click **Add**. Your window should look like this.

Click **Continue**, then click **OK**. Next, we’ll create the race=black dummy variable. Go to **Transform \(\rightarrow\) Recode into different variables**.

Select `RACE`

and use the arrow to shift it over to the *Input Variable* box. Under *Output Variable* change the name to `race_black`

, and the label to “Race = Black”. Then click **Change**.

Under *Old value*, we set *Value* to 2 (since black is coded as 2 in the original `race`

variable), and under *New Value*, set *Value* to 1. Then click **Add**. Under *Old value*, select **All other values**, and then under *New Value* set *Value* to 0. Then click **Add**. Your window should look like this.

Click **Continue**, then click **OK**. You can see we have created two new variables (`race_white`

and `race_black`

) in the data editor window.

### Descriptive Statistics

Now that we have our variables coded with variable and value labels, we may wish to look at some descriptive statistics. For categorical variables (i.e. variables with distinct groups, or categories, such as race) we will look at frequencies. For interval, or continuous, variables (such as age), we will look at the minimum, maximum, mean, and standard deviation.

To create a frequency table, go to **Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Frequencies…**

Let’s create a frequency table for race. Select the `RACE`

variable from the list and use the arrow to move it to the *Variable(s)* box.

Click **OK**. The frequency table will open in the output doc.

The first table provides the total number of valid and missing responses, if any exist; there are 2,867 responses to the race variable.

The next table provides the frequencies of each response.

*Frequency*is the number of responses for that category*Percent*is the number out of the total responses (valid + missing) times 100%*Valid Percent*is the percentage based on non-missing responses (in this case, percent and valid percent are the same because there were no missing observations)*Cumulative Percent*is the percent of each response plus the percentage from previous categories

Now let’s take a look at age. Go to **Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Descriptives…**

Select `AGE`

and use the arrow to move it into the *Variable(s)* box.

You can also add other statistics (e.g. skew, kurtosis for evaluating whether a distribution is normal) by using the **Options…** button.

Leave the defaults checked for now and click **Continue**. Then click **OK**.

*N*provides a count of the responses*Minimum*is the smallest response*Maximum*is the largest response*Mean*gives the average value and is used to measure central tendency*Std. Deviation*is the standard deviation, which is a measure of dispersion

### Creating Graphs

There are many different graphs that SPSS can create - enough to fill multiple tutorials. However, we will just focus on a couple; histograms and bar graphs.

Histograms are used to visualize continuous data by creating “bins” for the frequency of datapoints in each section of values. Bar graphs are used to visualize categorical data by generating a bar for each category whose height is proportional to the frequency of values.

There are two main ways SPSS can create these visualizations; through the Chart Builder, and through Legacy Dialogs. First let’s cover the legacy dialogs.

Say we wish to create a histogram of `Age`

. Go to **Graphs \(\rightarrow\) Legacy Dialogs \(\rightarrow\) Histogram…**

The *Histogram* window will open.

Select `Age`

and use the arrow to move it into the *Variable* box.

Then click **OK**.

We can see the data appears to be bimodal with a peak at 30 years and another at approximately 55 years. The mean age is 49.33 years with a standard deviation of 17.905 years.

Now, let’s create a bar graph of race. Go to **Graphs \(\rightarrow\) Legacy Dialogs \(\rightarrow\) Bar…**

Select *Simple*, and *Summaries for groups of cases*. Then click **Define**.

The following window will open.

We can specify what we wish the bars to represent.

- N of cases is the number of cases in each category
- Cumulative N will add the previous categories to each subsequent bar
- % of cases is the number of cases out of the total times 100%
- Cumulative percent adds the previous category percents to each subsequent bar
- Other statistic allows you to specify another value (such as mean, minimum, etc.)

Select *N of cases*.

Category axis is the variable we wish to graph. Select `Race`

and use the arrow to move it into the category axis box.

The window should look like this:

Then click **OK**. The following figure will be created:

Most respondents to the survey were white, followed by black, and the fewest respondents were other.

Next, let’s create the same figures using the chart builder. The benefit of the chart builder is that it is a lot more flexible than the legacy dialogs.

Go to **Graphs \(\rightarrow\) Chart Builder…**

When the chart builder window first opens, it will be blank.

There are four main sections in the chart builder window.

- Section A provides the variables in your dataset
- Section B is the chart preview, which will be used to build your chart
- Section C allows you to edit the chart properties, appearance, and options (these will change depending on the type of chart you build)
- Section D is the Gallery, where you select the chart template you are starting with, basic elements (where you can edit the axes and other elements), Groups/Point ID (which can be used to add clustering/paneling/etc), and titles/footnotes (which can be used to add title/footnote elements to your chart)

In the gallery, select **Histogram**, then click and drag the **Simple Histogram** to the chart preview section.

This will automatically insert the the simple histogram template into the chart preview window. You can see there are three values that can be edited; *Y-axis?*, *X-Axis?*, and *Filter?*. Click on `Age`

in the variables window and drag it to the *X-axis?* box.

Setting a variable in the y-axis allows you to set histogram values rather than having SPSS calculate them. This is not relevant for us so we will leave it as is. The filter value allows you to filter the data by some other variable; Again we will leave this blank.

We can customize the color under *Chart appearance*, change axis labels and chart titles under *Element Properties* and more. However, let’s leave it as the defaults for now and click **OK**.

This creates the same graph as the legacy dialogs did.

Now, let’s create the bar chart using the chart builder. Again, go to **Graphs \(\rightarrow\) Chart Builder…**

This time, select **Bar** in the gallery, then click and drag **Simple Bar** to the chart preview section.

This time, click `Race`

in the variables window and drag it to the *X-axis?* box. Then click **OK**.

Again, we get the same chart as we did in the legacy dialogs. It is possible to create most charts using either method–the trade-off is simplicity (legacy dialogs) versus comprehensibility (chart builder).

### Performing Analyses

Finally, we will go over how to do some basic analyses with SPSS; specifically, calculating correlations, and running a linear regression model.

Let’s take a look at the correlations between age and income. Go to **Analyze \(\rightarrow\) Correlate \(\rightarrow\) Bivariate…**

The *Bivariate Correlations* window will open. We will select `Age`

and `INCOME`

and use the arrow to move them into the *Variables* box. Then select what type of correlation coefficients we want calculated, we will stick with Pearson’s *r*. The test of significance defaults to *Two-tailed*, which is standard.

Click **OK**. You will get the following correlations table:

The values on the diagonal are 1’s, since any variable correlated with itself will be 1. We can see that age and income have a Pearson correlation of \(r = 0.021\), which is not significant at the \(\alpha = 0.05\) level (\(p = 0.251\)).

Let’s run a regression model to predict years of education based on age, sex, and the race dummy variables we created above. Go to **Analyze \(\rightarrow\) Regression \(\rightarrow\) Linear…**

The dependent variable is `EDUC`

, so use the right arrow to move that into the *Dependent* box. The independent variables are `Age`

, `SEX`

, `race_white`

, and `race_black`

so use the arrow to move those into the *Independent(s)* box.

Then click **OK**. We get the following output:

The first table lists the variables used in the model, and the method used to enter them into the model (default = Enter).

The next table provides the model summary, which gives the *R*, *R-square* (which is a measure of model fit), *Adjusted R Square* (which is a more conservative estimate of model fit), and *standard error of the estimate* (which is a measure of dispersion).

The ANOVA table tests whether the model as a whole is significant. We can see given *Sig = 0.000* that there is at least one significant factor.

Finally, the coefficients table provides the regression model coefficients, along with their significance. Unstandardized coefficients are provided with their standard errors, as well as standardized betas. A t-test is conducted for each coefficient, and the t-value is given. Then the significance is determined (i.e. p-value). We can see that age is a significant factor (\(p=0.009\)), as is race = white (\(p<0.001\)), while sex and race = black are not.

The purpose of this tutorial has been to provide a starting point for using SPSS. For a more in-depth look at specific analyses, see our other SPSS tutorials here.