Hierarchical linear models — also known as mixed models, multilevel models, and random effects models — are now common in the social sciences. Their popularity stems from the frequency with which analysts encounter data that are hierarchically structured in some manner. Employees may be nested within firms, students within schools, or voters within districts. There may even be multiple observations taken on a single individual that can be considered to be nested within that person. When data are measured at different levels, standard assumptions of independence and homoskedasticity are violated. Hence, there is need for a more sophisticated modeling strategy.

Unfortunately hierarchical linear models (HLM) are also among the most misunderstood statistical methods researchers commonly employ. The development of multilevel modeling stems from developments in the analysis of experiments, in which researchers incorporate random effects(to be defined below) to account for interventions whose treatment categories are not exhaustive. Many textbooks that demonstrate how to estimate multilevel models with general-purpose software therefore use examples from experimental designs. Unfortunately this confuses things for those who approach nested data from a background in observational studies. Thus a political scientist analyzing time series cross-sectional data (for example, data taken on a set of countries repeatedly over many years) may not realize she is using exactly the same method as the psychologist studying a treatment administered by different therapists or a sociologist studying state-level differences in attitudes towards sexuality.

This brief tutorial thus outlines the motivation for multilevel models in a manner that seeks to clarify the relevant terminology for researchers whose background is in observational studies.

### Between-Subjects Designs

##### Fixed Effects Models

In a between-subjects experimental design, participants are assigned to different treatment groups such that each individual is exposed to only one level of the manipulation. The idealized case is known as a completely randomized (CR) design, because subjects are assigned completely at random to a specific treatment level. For example, subjects with high blood pressure may be randomly assigned to receive an experimental drug, a drug already on the market, or a placebo. The model for this one-way ANOVA (an ANOVA with only one treatment) is:

\[y_{ij} = u + \alpha_{j} + e_{ij}\] where the score on the dependent variable for individual i in the jth treatment group is equal to the grand mean of the sample (\(\mu\)), the treatment effect (\(\alpha\)), and an individual error term \(e_{ij}\). This notation is different from regression in that there are no beta weights to estimate. Instead, each α represents the expected change in the group mean for treatment *j*, and the \(e_{ij}\) represent additional subject-specific deviations from the expected outcome. In general, some kind of constraint is put on the alpha values, such as that they sum to zero, so that the model is identified. In addition, the investigator assumes that the errors are independent and normally distributed with constant variance. (Note that an equivalent way of estimating the model would be to drop the \(\alpha_j\) notation, add dummy variables for J – 1 treatments, and utilize least squares regression.)

The primary hypothesis is that there is a significant difference between group means.

\(H_{0}: \mu_1 = \mu_2 = ... = \mu_j\)

\(H_{1}: \mu_j \ne \mu_{j'}\)

where \(\mu_{j} = \mu + \alpha_{j}\). The researcher will first examine an omnibus F-test to determine whether any of the group means differ. If the null hypothesis of no differences is rejected, the next step is to carry out pre-planned or post-hoc contrasts to determine which specific group means are not equal. Estimating contrasts requires comparing (usually) two means while adjusting alpha levels to account for the fact that the researcher is conducting multiple tests.

The one-way ANOVA is easily extended to the case in which there is a second treatment. For the blood pressure example, subjects may first be assigned to a pill treatment and then assigned to either engage in regular supervised exercise or not. When the researcher tests for the individual effects of each factor as well as their interaction, the design is said to be **fully factorial**. When the additional criterion of random assignment to each treatment is met, the experiment is said to be a **completely randomized factorial** (CRF) design.

CR and CRF designs represent the most basic and easily analyzed experiments. The treatments in these examples are considered to be **fixed effects** because the researcher is interested in the effect of the specific levels of the factors. If the experiment were replicated, the very same manipulations would appear. More complicated designs, those that are the building blocks for multilevel regression models, also incorporate random effects.

##### Random Effects and Mixed Models

In many situations, the investigator may wish to acknowledge a possible effect coming from a factor whose specific, fixed values are not of interest. Instead, the levels that are present in the experiment represent a random sample from a larger population. For example, in a study looking for the effect of a new drug on blood pressure, different doctors may prescribe the pill to different patients. The effect of a specific physician is not of theoretical interest, yet the investigator may suspect that different health care providers can contribute to a patient’s outcome. Because the doctors prescribing the drugs are drawn randomly from a larger population, their impact on the outcome is considered to be a **random effect**. In a replication of the experiment, different physicians will likely be involved. The equations in the previous section are called **fixed effects models** because they do not contain any random effects. A model that contains only random effects is a **random effects model**. Often when random effects are present there are also fixed effects, yielding what is called a **mixed** or **mixed effects model**. Thus software procedures for estimating models with random effects — including multilevel models — generally incorporate the word MIXED into their names.

One convention when writing mixed effects ANOVA models is to use Greek letters for the fixed factors and Latin characters for random effects. A mixed model that includes an interaction between the fixed and random effects would be:

\[y_{ijk} = \mu + \alpha_j + b_k + (\alpha b)_{jk} + e_{ijk}\]

Note that random effects are not directly estimated. Rather, they are treated as random variables with a mean of zero and unknown variance σ2, which is referred to as a **variance component**. When multiple random effects are present, the assumption is that they are distributed multivariate normal with a mean of zero and a covariance matrix *G*. The elements along the diagonal correspond to the variance components of each random effect, and the off-diagonals correspond to their covariances. Because the purpose of including random effects is to generalize to a larger population, the specific group means observed in the experiment are not of interest. Instead, the null hypothesis corresponding to a particular random effect is that its estimated variance component is not greater than zero:

\(H_{0}: {\sigma_b}^2 = 0\)

\(H_{1}: {\sigma_b}^2 > 0\)

One type of design where random effects are often used occurs when investigators employ **blocking**, which is the experimental analog to stratification in survey research. Experimenters identify homogenous groupings within the sample and separate subjects into these categories. The treatment of interest is then applied to individuals in a completely randomized manner so that every treatment level appears in each block. The purpose is to minimize nuisance variation that may be due to the grouping variable, thereby producing a more powerful test of the treatment. The blocks may be included in the model as a fixed effect or a random effect, depending on whether all possible levels of the blocking variable are present. If the experimenter first blocked on gender, for example, the blocking factor would be fixed because all possible levels are present. If the experimenter blocked on city of birth, the factor would be random because a replication could plausibly include other towns. Experiments with one treatment and one block are called **randomized block** (RB) designs.

### Within-Subjects Designs

##### Univariate ANOVA Approach

Oftentimes researchers in the lab find it advantageous to expose the same subject to all treatment levels. For example, subjects in a focus group may be given pictures from six different presidential candidates and asked to rate their affective response to each. The resulting mixed model is:

\[y_{ij} = \mu + \alpha_j + b_k + e_{ij}\]

where the \(\alpha_j\) refers to the fixed effect of exposure to picture *j* and \(b_k\) is a random effect representing each person. The investigator is not interested in the specific individuals involved in the experiment, and in a replication others would likely be present.

While providing more powerful tests, there are limitations to using ANOVA for within-subjects design. It can be shown that, for the univariate ANOVA approach to be valid, differences in treatment levels must at be equally variable. Because this assumption is rarely met in practice, the literature on repeated measures experiments focuses on finding appropriate corrections, either through adjusting degrees of freedom for the F-test or by turning to a multivariate analysis of variance (MANOVA).

Despite the complications that within-subjects designs pose, the most common experimental setup includes both within- and between subjects factors. This approach is sometimes erroneously called a mixed design; a preferable name is *split-plot* design (because in early applications the repeated measures were taken on plots of land rather than individuals). The repeated measures can be thought of as being nested within an individual in the same manner that voters, for example, can be thought of as being nested within a country. The same data modeling considerations — viz. error non-independence and heteroskedasticity — are present in both contexts.

##### Covariance Pattern Models

Longitudinal studies are a variation on the split-plot design, containing both a within-subjects factor (time) and a between-subjects factor (treatment). The complication is that the investigator generally wants to say something more specific about the correlations between errors.

A generic model for longitudinal data is the following:

\[y_{ijk} = \mu + \alpha_j + \beta_k + (\alpha \beta)_{jk} + e_{ijk} \]

where \(\mu\) is the overall average outcome, \(\alpha_j\) is the fixed treatment effect, \(\beta_k\) is the (in this case) fixed effect of time, and \((\alpha \beta)_{jk}\) is the time by treatment interaction. The error term \(e_{ijk}\) can be summarized by a block diagonal covariance matrix **R**, in which each block corresponds to a different individual. The structure of the blocks in **R** reflects the researcher’s assumptions about the pattern of error correlations within individuals.

The most general assumption the investigator can make about **R** is that it is unstructured, meaning that the variances and covariances are all freely estimated from the data. Assuming no structure places the fewest restrictions on the model, but the presence of many repeated measures can cause the number of estimated parameters to quickly grow quite large. Another alternative is to specify a first-order autoregressive [AR(1)] matrix, which assumes all variances to be equal and all covariances to decay exponentially as the temporal distance increases. Some of the most commonly used covariance structures are listed below; a parsimonious yet well-fit pattern must be specified for inferences about fixed effects to be accurate.

One drawback to the approach of modeling the covariance structure of **R** in a longitudinal design is that, for many of the candidate patterns, the space between observations must be constant. This is clear, for example, when an AR(1) structure is assumed. If the timing of data collection varies across individuals, the correlations will decay at different rates for different subjects. An additional problem is that subjects vary more than what is allowed by the previous model, which assumes a common intercept and time slope for all individuals.

The solution to these problems is to introduce a random effect representing the subject, and to additionally treat time as a random instead of a fixed effect. As in the previous mixed models, these random effects are assumed to be normally distributed with a mean of zero and covariance matrix **G**. In addition, **G** and **R** are assumed to be independent. It can be shown that including both a random intercept and a random time slope induces correlation among the repeated measurements in the model and eliminates the need to explicitly define a structure for **R**. Furthermore, treating time as a random effect allows the covariances of the repeated measures to explicitly become functions of time, which makes it possible to accurately model outcomes in longitudinal studies when observations are not equally spaced.

Mixed model commands in most statistical packages have options for specifying a structure for the **R** and **G** matrices, so it is important to know what these options are doing. In general, users will only be concerned with the structure for **R** if estimating a model for longitudinal data that does not include a random effect for time. On the other hand, many applications of mixed models assume an unstructured **G** matrix, which means all variance and covariance components are estimated from the data. This is not the default, however, and the user must tell the software that **G** is unstructured. Still, it is not unheard of for researchers to fit a hybrid model to longitudinal data that makes more specific assumptions about the structure of both matrices.

Most importantly, the utility of mixed models extends beyond controlled, randomized repeated measures designs. The inclusion of random effects can account for violations of error assumptions even in the context of cross-sectional observational designs. The canonical example is studying test scores of students who are clustered in different schools, where one student’s performance is likely correlated with the performance of another student.

The next section describes the notation for mixed models and how to alternate between them.