How to Analyze Survey Data

Jeremy Albright

Posted on
survey stratification

When most students begin to study statistics, they are generally taught formulas which assume that all of the observations they are analyzing had an equal probability of being selected into the sample. The reason for this assumption is that, given simple random selection, estimation of statistics of interest (e.g. means, regression coefficients, etc.) is straightforward. In addition, assuming equal selection probabilities makes it possible to calculate a statistic’s variance (a measure of the precision of an estimate) without much difficulty. The problem is that a substantial amount of the data analyzed in the social sciences comes from surveys that are collected in a manner that intentionally violates the assumption of simple random sampling. Failure to take the survey design into account can lead to estimates which are wrong, while standard errors around even correct estimates may be misleading.

In particular, two sampling procedures are commonly utilized alone or in some combination which require making adjustments to estimates.

  • Stratification: In order to guarantee sufficient representation of particular subgroups in a population, survey researchers often first separate elements into specific groups, called strata, and then randomly sample from each distinct list. For example, a survey designer may stratify by race to ensure a sufficient representation of minority cases. This would involve creating one stratum for white respondents, one for black respondents, and one for Hispanic respondents before randomly drawing observations from each list. Doing so guarantees that blacks and Hispanics will be present in the sample even if their number is small in the population. One added benefit of stratification is that it usually leads to smaller standard errors (more precise estimates).

  • Clustering: Cost considerations often lead to the use of cluster sampling. If the sampling frame (the set of units from which a sample will be drawn) consists of all adults in the United States, and the data collection method will involve face-to-face interviewing, it will not be feasible to send interviewers to all corners of the country. Instead, the survey team will identify clusters of proximate observations (such as counties or city blocks) and either interview all constituent elements or randomly sample observations from within each grouping. Although cluster sampling is used to save resources, it comes with the analytic cost of larger standard errors (less precise estimates).

A couple of adjustments must be made to correct for non-equal selection probabilities in the sampling design. First, weights are constructed to account for the fact that some observations had a higher probability of being sampled relative to their distribution in the population. Second, estimates of a statistic’s variance are adjusted to account for weighting and the use of stratification and/or clustering. Microdata files which contain sampling weights and information on strata/clusters are typically given the intimidating label of complex survey data, and specialized software is necessary to appropriately estimate statistics of interest.

In addition to weights, clusters, and strata, some other terms that commonly arise in complex survey analysis include

  • Finite population correction (fpc): This is an additional correction to the variance estimate that becomes important as the size of the sample relative to the size of the population increases. When the sample is small compared to the number of elements in population, the fpc approaches one and has very little effect on the estimation. For most large-scale public opinion surveys, the fpc can be safely disregarded.
  • Design Effect: This is a ratio of a statistic’s variance in a complex design to the variance that would have been estimated using simple random sampling. It can be used to quantify the extent to which the sampling design has inflated or deflated the precision of estimates. It can also be used in the context of power analysis. Because sample size calculations are typically made assuming simple random sampling, the design effect can be used to translate the results from a traditional power analysis to the context of complex sampling.
  • Poststratification: This is an additional weighting adjustment made after all observations have been collected to ensure that the weighted sample reflects the population distribution. Using the existing weights, the sample distribution on a set of variables is compared to the population. A new weight is then created to bring the weighted distribution in line with the known population distribution. That weight is finally multiplied to the original weights to create the final sampling weight to be used in estimation. In addition to further reducing bias in the estimates, poststratification also generally leads to smaller standard errors.

Several software options are now available to survey researchers. Stata provides among the most user-friendly and comprehensive syntax for analyzing complex survey data. After identifying the design using the .svyset command, one simply adds the .svy prefix to most familiar commands. Stata then makes all the appropriate adjustments to point estimates and variances. R’s survey package (written by Thomas Lumley) offers similar capabilities, with the user declaring the design as an object in one function and then referencing that object in subsequent calls to other survey-specific functions. SAS offers limited support for complex survey data through its SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEY LOGISTIC procedures. The PASW (SPSS) base package allows for weighting, but it is necessary to purchase the Complex Samples add-on modules to make the correct adjustments to variance estimates.

In practice it can be quite tricky to determine how to correctly specify the precise design in even the most user-friendly software. The sampling documentation can be dense and arcane, with many surveys employing multi-stage designs based on some combination of cluster sampling, stratification, and random draws. What makes the problem even more complicated is that many organizations will not release complete information on clusters or strata in order to protect the confidentiality of survey participants. The fear is that a malevolent data user will take advantage of the design information to identify the geographic location of a respondent and, ultimately, that respondent’s identity. This is particularly a risk if the survey collects information on sensitive topics such as drug use, health, or sexual histories. In order to avoid this possibility, survey organizations may release, at most, highly aggregated design information.

Users who are struggling to determine which variables describe the sampling design should contact the survey organization directly. Alternatively, they may contact us.