Introduction to Data Visualization

Jacob Culver

Posted on
Data Visualization Histograms Bar Charts Grouped Bar Charts Scatter Plots Kernel Density Plots Boxplots LOESS Smoothing

There are a variety of ways that data is communicated, but perhaps none are more important than data visualization. It’s not only an important tool for clearly communicating facts and figures; it’s also big business. In June of 2019, Salesforce bought the big data firm Tableau Software for a reported \$15.3 billion.

In this series of tutorials, we will explore common ways that data is visualized, the benefits and shortcomings of certain visualizations, and how to implement the visualizations in R, SAS, SPSS, and Stata.

Understanding the Different Types of Data

To select the best data visualization for a situation it is first important to understand the differences in variable types. The most important distinction to make is between quantitative and qualitative (aka categorical) data.

Qualitative or categorical data capture characteristics that are difficult to quantify, but can be observed and/or separated into discrete categories. These categories may not have any clear ordering to them (e.g. colors, shapes), known as nominal, or they may be ordered (e.g. educational degree), known as ordinal. Note that, although ordinal categories have a clear ranking, the distance between these categories is not captured.

Quantitative data relates to a numerical quantity. Quantitative data are those that are captured on a numeric scale and for which the distance between categories (if they are used) matters.

Continuous data are data that can be made more precise with more precise measurement. For example, we could say that someone is 30 years old, but this measurement would also make sense (and be more precise) if we said the person was 30.25 years old, or 30.258, or 30.2589012… Most variables we graph and model, however, are not truly continuous. Instead, we tend to measure variables using a common interval (age in years, not in microseconds), and hence we typically refer to these variables as interval rather than continuous. A variable measured on a ratio scale is an interval variable that excludes the negative half of the real number line, e.g. height.

Univariate Data Visualizations

The main way to visualize a single variable is to look at the variable’s distribution. This can be done using a coordinate plane, where the variable is placed on one axis (generally the $$x$$-axis) and the variable’s count or proportion is measured on the other axis (generally the $$y$$-axis).

Categorical Variables

When we wish to visualize the distribution of a single categorical variable, we turn to bar graphs. A bar graph can depict both nominal and ordinal data. Generally, the categories of the variable will be placed on the $$x$$-axis and the count will be placed on the $$y$$-axis. Using the mpg dataset from R, Figures 1a and 1b show the categories of the variable class on the $$x$$-axis, and the height of the bars depict the number of cars in the data that fall into each class. In Figure 1b, the tops of each bar have been labeled with the count to add clarity to the visualization.

Often it is more helpful to represent the $$y$$-axis as a proportion or percentage (see Figures 1c and 1d) rather than as a raw count. Here the number of observations in each category is divided by the total number of observations to get the proportion (and then multiplied by 100 to get the percent) of the observations that are in that category. In Figure 1d, the count has been added to the top of the bars so the visualization can communicate both the raw N and the relative percentage in the category.

Continuous Variables

Histograms

The equivalent of a bar graph for a continuous variable is a histogram. Like a bar graph, a histogram uses bars to display the number, proportion, or percent of observations in a group. Unlike a bar graph, the bars of a histogram touch because the variable is continuous, and the “groups” simply represent a particular range of that continuous variable. These ranges are referred to as bins, and the bar will depict information about the number of observations that fall into each bin. A histogram can also depict the density of each bin. The density can be calculated by dividing the number of observations in each bin by the width of the bin. Changing the width of the bins (or the range of values in each grouping of the variable), can dramatically change the visualization. The histograms below show the distribution of the variable Sepal.Width from the iris dataset in R. Each figure shows the $$y$$-axis in a slightly different way; either count, proportion, percentage, or density. To demonstrate how changing the number (and therefore the width) of the bins can change the visualization, each histogram is given twice. The first figure cuts the data into 20 bins, and the second figure cuts the data into 10 bins.

Kernel Density Plots

Another helpful way to visualize the distribution of a single continuous variable is to use a kernel density plot. A density plot is a smoothed histogram, and the amount of smoothing can be controlled via a bandwidth setting. The kernel density estimator is given by:

$\hat{f_{h}}(x) = \frac{1}{n}\sum_{i=1}^{n} K_{h}(x-x_{i}) = \frac{1}{nh}\sum_{i=1}^{n} K\left(\frac{x-x_{i}}{h}\right)$

where $$K$$ is the kernel and $$h$$ is a smoothing parameter known as the bandwidth. The kernel is a function, and most software have a list of functions to choose from. In the examples below we used the “Gaussian” or normal pdf as our kernel. The bandwidth, $$h$$, has a large impact on the estimate and therefore a large impact on the visualization. Figure 3a below shows the kernel density plot of 500 random samples from the Normal Distribution. The black curve shows the true distribution. By comparison, the red curve, which uses a bandwidth of $$h=0.1$$, is undersmoothed. The blue curve, which uses a bandwidth of $$h=2$$, is oversmoothed. There are a few “rules-of-thumb” for choosing an optimal bandwidth, and most software will select the bandwidth for you using one of these rules.

It is also common to show a histogram of a variable overlaid by a kernel density plot. Figure 3b shows what this might look like using our histogram of Sepal.Width from earlier.

Bivariate Data

Two Categorical Variables

As we discussed earlier, data on categorical variables are typically summarized with counts, percentages, proportions, or densities on the $$y$$-axis and the discrete categories on the $$x$$-axis. When we introduce a second categorical variable, there are multiple options for where to put it.

Let’s use an example to help us visualize where the second categorical comes from and the options we have for visualizing both variables simultaneously. Figure 4a shows a simple bar graph of the number of cars in the mtcars data that have 4, 6, or 8-cylinder motors.

Now what happens if I want to introduce a second categorical variable, e.g transmission type? Figure 4b shows the previous bar graph, but now with the bars divided into transmission type. This is a Stacked Bar Graph.

Stacked bar graphs give an intuitive sense of whether the proportion of one variable is consistent throughout the categories of another variable. In this case we can see that the proportion of cars with manual transmissions is much higher for 4-cylinder motors than it is for either of the other two types of motors. The drawback of the stacked bar graph is that doesn’t communicate the raw numbers the way that we might want it to. For example, the question “how many 4-cylinder cars have automatic transmissions?” is not readily apparent on this graph.

We can remedy this ambiguity by adding counts to each section of each bar (see Figure 4b: Stacked Bar Graphs with Counts). The benefit is the clarity that the labels bring, but we should be careful as we build data visualizations not to overcrowd the graphic with too much information if it can be avoided. In the end, it is up to the creator to decide if a visualization communicates the necessary information in the clearest and most aesthetically pleasing way.

Figure 4c shows a Grouped Bar Graph (both with and without count notations). The grouped bar graph simply “unstacks” the stacked bar graph and puts them next to one another on the $$x$$-axis in a “group”.

Now the visualization more readily conveys exact numbers, and while the proportional make-up of each group is not quite as intuitive as it was in the stacked bar graph, it is still fairly apparent that 4-cylinder motors have a larger proportion of manual transmissions than the other types of motors.

A challenge can arise when the variable we place on the $$x$$-axis has drastically different counts across its categories. To illustrate this problem, consider this question: Can knowing whether or not someone is wearing a hat help predict that person’s sex? Figure 4d shows a grouped bar chart of 2,000 students, 1,000 boys and 1,000 girls, and whether they were observed wearing a hat. In this plot it is clear that far fewer students wore a hat and this discrepancy in our independent variable makes the graph less than desirable.

The bars in Figure 4e show what percent of the “No Hat” group were girls and what percent were boys and similarly what percent of the “Hat” group were girls and what percent were boys. The bars in each group of our independent variable (Hat vs No Hat) sums to 100%. From this graph we can tell that of the people wearing hats far more of them were boys. Our visualization lacks the specificity of some of the other bar graphs, but it now gives us insight on our original question while helping to balance a visualization of unbalanced data.

Taking the visualization a step further, we can add the explicit counts to the top of each bar to add specificity.

Two Continuous Variables

When we visualize two continuous variables, Scatter Plots are the go-to graphing technique. Scatterplots allow us to visualize what type of relationship, if any, exists between the two variables. Figure 5a below shows the relationship between the car’s weight and miles per gallon the car gets as a measure of fuel efficiency. From this plot we can clearly see there is a relationship between these two variables. Very generally, a car that is heavier is less fuel efficient.

Smoothing techniques can be used with scatterplots to help us better understand the nature of the relationship between the two variables. Data smoothing attempts to capture and visualize important patterns by diminishing the noise in the data. When we use a smoothing technique, a line of our choosing is fit over our scatterplot to help us visualize if this line is a good fit of the data. The line may optionally be surrounded by a confidence interval. Figure 5b below shows the same scatterplot from 5a, but this time fit with a linear, quadratic, and cubic smoother, each with a 95% confidence interval.

The three scatterplots in Figure 5b are considered parametric smoothing strategies because the shape of the relationship was assumed, and the curves were fit according to the assumptions. LOESS smoothing is a nonparametric smoothing technique that stands for locally estimated scatterplot smoothing. LOESS smoothing marries least squares regression with the flexibility of nonlinear regression. It does this by estimating the least squares fit to local subsets of the data to build a function that describes the relationship as a whole. Using LOESS, an analyst does not have to define a relationship for the entire range of the data, hence it is a nonparametric technique.

A user-defined input to the LOESS technique known as the “bandwidth” determines the size of the subsets of data used for each least squares fit. The bandwidth is the fraction of the total number of data points to be used in each local fitting. An algorithm is applied that determines which observations are used and how much they are weighted when generating the line at each point along the range of data. A smaller bandwidth uses less points in each fitting and therefore creates a wigglier line. Figure 5c shows the same scatterplot (car weight vs mpg) this time using the LOESS technique. The blue curve was fit using a bandwidth of 0.75 and the red curve with a bandwidth of 0.5.

One Categorical Variable and One Continuous Variable

A boxplot, sometimes called a box and whisker plot, is great for visualizing the central tendency and spread of a continuous variable across the levels of a categorical variable. Like a bar graph, the levels of our categorical variable will be placed on one axis, and each level of the variable will have one box. Figure 6a shows the box plots of the Sepal Width across the three types of iris species in the dataset iris.

The line in the middle of the box represents the median value. The box represents the interquartile range (IQR) of the data with the horizontal boundaries of the box representing the 25th and 75th percentiles (known as Q1 and Q3 respectively). For the line extending below the box, the whisker connects the box to either the smallest data point or the smallest data point that is not less than Q1 - 1.5xIQR. For the line extending above the box, the whisker connects the box to either the largest data point or the largest data point that is not greater than Q3 + 1.5xIQR. Any points that are above or below the whiskers represent data points that can be considered to be outliers.

The boxplot can also be turned horizontally to show the categorical variable on the $$y$$-axis and the continuous variable on the $$x$$-axis. This can be useful if your dependent variable is categorical and your independent variable is measured on an interval scale. Figure 6b shows this version of the boxplot with the species of iris again being the categorical variable and this time Sepal Length as the continuous variable. In both boxplots we can clearly see that there is a difference in the central tendencies and even the spread of the data across different species of irises. If we were to conduct an ANOVA test to analyze the variance of sepal width (or sepal length) among the species of irises, the boxplot would help us illustrate the conclusion of the test.

Trivariate Data

While we live in a three-dimensional world, the majority of the ways that we get information are two-dimensional (i.e. screens and print). This presents a bit of a challenge when trying to visualize data that has three variables (trivariate data). We can overcome this challenge in a number of different ways.

Figure 7a and 7b show two ways to represent the relationship between the continuous variables of sepal width and sepal length while considering the categorical variable of species. In Figure 7a all data points are represented on the same coordinate grid with a scatter plot of the two continuous variables and each level of species represented by a different color. In Figure 7b the data is faceted by species, each with their own scatter plot of the two continuous variables.

In both figures we can compare how sepal width and sepal length relate to one another across species of irises. Setosa irises, for instance, have a grouping that suggests they have sepals that are generally shorter and wider than the other two types of irises.

When creating a data visualization, consideration should be given to readers that are colorblind. This article from the website “towards data science” goes further in depth about the issue and has specific and simple steps you can take to make a visualization easier to comprehend for someone that is colorblind. One of the simple steps we can take to make the previous example easier to interpret is to include both colors and shapes.

Here is another look at Figure 7a. This time we have used one of the colorblind friendly palettes suggested in the article and we have used different shapes to distinguish the species of irises.

To show three continuous variables in one plot a strategy that is often employed is to convert one of the variables into a categorical variable. Figures 7c and 7d show the relationship between the variables sepal width, sepal length, and petal length. In order to create the plot, the variable petal length was converted into a categorical variable by using quartiles. Again we used colors/shapes and facets to distinguish the third variable.

Conclusion

Great data visualization is both a skill and an art, requiring both expertise and creativity. While there are many decisions that the user will have to make as a matter of preference, there are many well established practices that can make data visualization easier. Understanding the data and the information you wish to convey is at the heart of making clear and concise visualizations. Before you begin, it is essential to know how many variables you want to visualize and whether those variables are quantitative or qualitative; continuous or discrete; ordinal or nominal. The answers to these questions will guide you to an appropriate geometry. From there it is up to you to make the decisions that communicate the information in the best possible way.