Introduction to the Bootstrap in Stata

Michael Battaglia

Posted on
stata bootstrap

The bootstrap is a statistical procedure that resamples a dataset (with replacement) to create many simulated samples. You can calculate a statistic of interest on each of the bootstrap samples and use these estimates to approximate the distribution of the statistic. The bootstrap is most commonly used to estimate confidence intervals. This tutorial demonstrates how to use bootstrapping to calculate confidence intervals in Stata. See our blog post on bootstrapping for more specifics on the formulas used for the different types of bootstrap confidence intervals.

The data used in this tutorial are from Efron and Tibshirani’s (1993) text on bootstrapping (page 19). We have 15 paired observations of student LSAT scores and GPAs. We want to estimate the correlation between LSAT and GPA scores. The data are the following:

student lsat gpa
1 576 3.39
2 635 3.30
3 558 2.81
4 578 3.03
5 666 3.44
6 580 3.07
7 555 3.00
8 661 3.43
9 651 3.36
10 605 3.13
11 653 3.12
12 575 2.74
13 545 2.76
14 572 2.88
15 594 2.96

The sample correlation coefficient between LSAT and GPA is 0.776. We would like to also calculate the confidence interval for the correlation.

Without a transformation, the correlation coefficient is bounded in the interval \([-1, 1]\). This makes the usual formula for a confidence interval, \(r \pm t_{\alpha/2} SE_r\), problematic because it may possibly generate a range outside of this boundary. The bootstrap provides an easy solution to this problem that can be easily applied to most statistics. There are a variety of approaches to calculating confidence intervals based on the bootstrap, and Stata provides four: normal, percentile, bias corrected (BC), and bias corrected and accelerated (BCa) intervals.

Bootstrap Confidence Intervals

Normal Confidence Interval

The normal confidence interval starts with the usual formula for calculating a confidence interval, using the normal distribution value of 1.96 as the multiplier of the standard error

\[ \text{95% CI} = \theta \pm 1.96 SE_{\theta} \]

where \(\theta\) is any statistic, here Pearson’s \(r\). Note that the \(SE_{\theta}\) is calculated on the basis of the bootstrap results. That is, we take the standard deviation of our estimates across the \(b\) different bootstrap samples.

The problem with this method is that it assumes the distribution of the statistic follows a normal distribution. If this were true, we probably don’t need to bootstrap, and so we generally won’t use this method.

Percentile Confidence Interval

The percentile confidence interval uses the distribution of the bootstrap estimates to calculate the confidence interval. For a 95% confidence interval, the percentile confidence interval uses the 2.5% and 97.5% percentiles of your bootstrap estimates.

Bias Corrected (BC) and Bias Corrected and Accelerated (BCa) Confidence Interval

The bias corrected (BC) and bias corrected and accelerated (BCa) confidence intervals use percentiles of the bootstrap distribution, but they do not necessarily use the 2.5% and 97.5% percentiles. The percentiles used depend on the bias-correction factor (BC and BCa) and the acceleration parameter (BCa only). BC intervals adjust for bias in the bootstrap distribution, where bias is estimated as the difference between the statistic calculated using all of the data and the mean value of the statistic across the bootstrap samples. The BCa distribution further takes into account both bias and skewness. If the bias and acceleration terms are zero, these intervals reduce to the percentile confidence interval.

In general, BCa is the recommended bootstrap CI method. It is “second-order accurate”, meaning that it converges faster to the correct coverage. However, the accuracy of the bias and acceleration terms require a large number of bootstrap samples and, especially when using the jackknife to get the acceleration parameter, this can be computationally intensive.

The bootstrap command in Stata

Now, let’s return to our example and calculate the confidence interval of the correlation.

First we use the correlate command to calculate the correlation for the entire sample.

correlate lsat gpa
(obs=15)

             |     lsat      gpa
-------------+------------------
        lsat |   1.0000
         gpa |   0.7764   1.0000

The correlation is 0.7764, we will need to extract this value in our bootstrap program. After running a Stata command the results are stored in either r() or e(). Stata commands that perform estimation (regressions, factor analysis, anova, etc.) are e-class commands and the results are stored in e(). Other commands (summarize, correlate, etc.) are r-class commands and the results are stored in r(). After running a command you can run return list (r() objects) or ereturn list (e() objects) to see the values that were stored by the command. To see where the correlation is stored we use return list.

return list
scalars:
                  r(N) =  15
                r(rho) =  .7763744244963867

matrices:
                  r(C) :  2 x 2

The value that we need is stored in the variable r(rho).

To bootstrap the confidence interval, we will use the bootstrap command.

The syntax for the bootstrap command is a little complicated, so let’s break it down.

bootstrap {bootstrap_statisics}, {options}: {command}

The bootstrap command requires you to specify the bootstrap_statistics that you are interested in bootstrapping along with the command that calculates these statistics. You can also specify additional options. In our case, the bootstrap_statistics is r(rho) which is calculated by the command correlate lsat gpa. For options, we set the number of bootstrap replications to 1000 with reps(1000) and tell Stata that we are interested in calculating the BCa interval by adding the bca option. The bca option won’t appear to do anything in the initial call, but it is necessary so that the postestimation command estat bootstrap, all will print the BCa interval.

bootstrap r(rho), reps(1000) bca: cor lsat gpa
Bootstrap results                               Number of obs     =         15
                                                Replications      =      1,000

      command:  correlate lsat gpa
        _bs_1:  r(rho)

             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   .7763744   .1306152     5.94   0.000     .5203733    1.032376

By default, the output only displays the normal confidence interval. To get Stata to display the alternative confidence intervals use the postestimation command estat bootstrap, all. It will present the normal, percentile, BC, and BCa confidence intervals. Note that the BCa interval is only included if you specified the bca option in the initial bootstrap command.

estat bootstrap, all
Bootstrap results                               Number of obs     =         15
                                                Replications      =       1000

      command:  correlate lsat gpa
        _bs_1:  r(rho)

             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   .77637442   .0014095   .13050844    .5205826   1.032166   (N)
             |                                       .4964501   .9629022   (P)
             |                                       .4021074     .94435  (BC)
             |                                       .3073767   .9353622 (BCa)

(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval
(BCa)  bias-corrected and accelerated confidence interval

The BCa CI is [0.307, 0.935].

An additional note is that most estimation commands allow you to request bootstrap CIs without using the bootstrap command. To do this, you simply add vce(bootstrap, {options}) as an option to the command. For instance, if you want bootstrap CIs for the slope when regressing LSAT onto GPA, then you can do the following.

reg lsat gpa, vce(bootstrap, reps(1000) bca)
estat bootstrap, all
Linear regression                               Number of obs     =         15
                                                Replications      =       1000

             |    Observed               Bootstrap
        lsat |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
         gpa |   133.25087  -.7776537   30.816018    72.85259   193.6492   (N)
             |                                       66.07135   182.6677   (P)
             |                                       64.51045   182.0473  (BC)
             |                                       49.10582   177.9859 (BCa)
       _cons |   187.89964   1.892627   91.972659    7.636535   368.1627   (N)
             |                                       36.40686   390.0063   (P)
             |                                       38.13368   390.4037  (BC)
             |                                       50.42141   437.7433 (BCa)

(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval
(BCa)  bias-corrected and accelerated confidence interval

Writing a Bootstrap Program in Stata

We were able to easily bootstrap the correlation, because the correlate command stored the variable that we needed r(rho). In cases where you are interested in bootstrapping a statistic that isn’t directly stored by a command, you need to create a program that stores the statistic that you need.

Writing a program will be necessary when your statistic of interest takes multiple Stata commands to calculate. Some examples are statistics that are:

  • returned by a postestimation command
  • a function of two returned statistics (ratio or difference)
  • a transformation of a returned statistic

For an example, we will create a program to calculate the Fisher-\(z\) transformation of the correlation coefficient. When applied to the correlation coefficient, the Fisher-\(z\) transformation removes the [-1, 1] boundaries and results in a variable that is approximately normal.

\[ z = \frac{1}{2}\text{ln}\left(\frac{1+r}{1-r} \right) \]

We need the program to calculate the correlation, extract the value from r(rho), calculate the Fisher-\(z\) transformation, and return that value to r().

program define fisher, rclass
    correlate lsat gpa
    scalar rho = r(rho)
    return scalar fisher_cor = 0.5 * log((1+rho)/(1-rho))
end

The three lines of the program do the following:

  • calculate the correlation
  • save the correlation to a local variable
  • apply fisher-\(z\) transformation to the correlation and output the value

We first test this program on the whole dataset.

fisher

return list
scalars:
         r(fisher_cor) =  1.036178305294536

We can see that the program works and stores the result in r(fisher_cor)

We then run the bootstrap command using r(fisher_cor) as the statistic and fisher as the command.

bootstrap r(fisher_cor), reps(1000): fisher
estat bootstrap, all
Bootstrap results                               Number of obs     =         15
                                                Replications      =       1000

      command:  fisher
        _bs_1:  r(fisher_cor)

             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   1.0361783   .0764191   .36877182    .3133988   1.758958   (N)
             |                                       .5146473   1.963126   (P)
             |                                       .4691008   1.881243  (BC)

(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval

Still have questions? Contact us!