The bootstrap is a statistical procedure that resamples a dataset (with replacement) to create many simulated samples. You can calculate a statistic of interest on each of the bootstrap samples and use these estimates to approximate the distribution of the statistic. The bootstrap is most commonly used to estimate confidence intervals. This tutorial demonstrates how to use bootstrapping to calculate confidence intervals in Stata. See our blog post on bootstrapping for more specifics on the formulas used for the different types of bootstrap confidence intervals.
The data used in this tutorial are from Efron and Tibshirani’s (1993) text on bootstrapping (page 19). We have 15 paired observations of student LSAT scores and GPAs. We want to estimate the correlation between LSAT and GPA scores. The data are the following:
student | lsat | gpa |
---|---|---|
1 | 576 | 3.39 |
2 | 635 | 3.30 |
3 | 558 | 2.81 |
4 | 578 | 3.03 |
5 | 666 | 3.44 |
6 | 580 | 3.07 |
7 | 555 | 3.00 |
8 | 661 | 3.43 |
9 | 651 | 3.36 |
10 | 605 | 3.13 |
11 | 653 | 3.12 |
12 | 575 | 2.74 |
13 | 545 | 2.76 |
14 | 572 | 2.88 |
15 | 594 | 2.96 |
The sample correlation coefficient between LSAT and GPA is 0.776. We would like to also calculate the confidence interval for the correlation.
Without a transformation, the correlation coefficient is bounded in the interval \([-1, 1]\). This makes the usual formula for a confidence interval, \(r \pm t_{\alpha/2} SE_r\), problematic because it may possibly generate a range outside of this boundary. The bootstrap provides an easy solution to this problem that can be easily applied to most statistics. There are a variety of approaches to calculating confidence intervals based on the bootstrap, and Stata provides four: normal, percentile, bias corrected (BC), and bias corrected and accelerated (BCa) intervals.
Bootstrap Confidence Intervals
Normal Confidence Interval
The normal confidence interval starts with the usual formula for calculating a confidence interval, using the normal distribution value of 1.96 as the multiplier of the standard error
\[ \text{95% CI} = \theta \pm 1.96 SE_{\theta} \]
where \(\theta\) is any statistic, here Pearson’s \(r\). Note that the \(SE_{\theta}\) is calculated on the basis of the bootstrap results. That is, we take the standard deviation of our estimates across the \(b\) different bootstrap samples.
The problem with this method is that it assumes the distribution of the statistic follows a normal distribution. If this were true, we probably don’t need to bootstrap, and so we generally won’t use this method.
Percentile Confidence Interval
The percentile confidence interval uses the distribution of the bootstrap estimates to calculate the confidence interval. For a 95% confidence interval, the percentile confidence interval uses the 2.5% and 97.5% percentiles of your bootstrap estimates.
Bias Corrected (BC) and Bias Corrected and Accelerated (BCa) Confidence Interval
The bias corrected (BC) and bias corrected and accelerated (BCa) confidence intervals use percentiles of the bootstrap distribution, but they do not necessarily use the 2.5% and 97.5% percentiles. The percentiles used depend on the bias-correction factor (BC and BCa) and the acceleration parameter (BCa only). BC intervals adjust for bias in the bootstrap distribution, where bias is estimated as the difference between the statistic calculated using all of the data and the mean value of the statistic across the bootstrap samples. The BCa distribution further takes into account both bias and skewness. If the bias and acceleration terms are zero, these intervals reduce to the percentile confidence interval.
In general, BCa is the recommended bootstrap CI method. It is “second-order accurate”, meaning that it converges faster to the correct coverage. However, the accuracy of the bias and acceleration terms require a large number of bootstrap samples and, especially when using the jackknife to get the acceleration parameter, this can be computationally intensive.
The bootstrap command in Stata
Now, let’s return to our example and calculate the confidence interval of the correlation.
First we use the correlate
command to calculate the correlation for the entire sample.
correlate lsat gpa
(obs=15)
| lsat gpa
-------------+------------------
lsat | 1.0000
gpa | 0.7764 1.0000
The correlation is 0.7764, we will need to extract this value in our bootstrap program. After running a Stata command the results are stored in either r()
or e()
. Stata commands that perform estimation (regressions, factor analysis, anova, etc.) are e-class commands and the results are stored in e()
. Other commands (summarize, correlate, etc.) are r-class commands and the results are stored in r()
. After running a command you can run return list
(r()
objects) or ereturn list
(e()
objects) to see the values that were stored by the command. To see where the correlation is stored we use return list
.
return list
scalars:
r(N) = 15
r(rho) = .7763744244963867
matrices:
r(C) : 2 x 2
The value that we need is stored in the variable r(rho)
.
To bootstrap the confidence interval, we will use the bootstrap
command.
The syntax for the bootstrap command is a little complicated, so let’s break it down.
bootstrap {bootstrap_statisics}, {options}: {command}
The bootstrap command requires you to specify the bootstrap_statistics that you are interested in bootstrapping along with the command that calculates these statistics. You can also specify additional options. In our case, the bootstrap_statistics is r(rho)
which is calculated by the command correlate lsat gpa
. For options, we set the number of bootstrap replications to 1000 with reps(1000)
and tell Stata that we are interested in calculating the BCa interval by adding the bca
option. The bca
option won’t appear to do anything in the initial call, but it is necessary so that the postestimation command estat bootstrap, all
will print the BCa interval.
bootstrap r(rho), reps(1000) bca: cor lsat gpa
Bootstrap results Number of obs = 15
Replications = 1,000
command: correlate lsat gpa
_bs_1: r(rho)
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | .7763744 .1306152 5.94 0.000 .5203733 1.032376
By default, the output only displays the normal confidence interval. To get Stata to display the alternative confidence intervals use the postestimation command estat bootstrap, all
. It will present the normal, percentile, BC, and BCa confidence intervals. Note that the BCa interval is only included if you specified the bca
option in the initial bootstrap
command.
estat bootstrap, all
Bootstrap results Number of obs = 15
Replications = 1000
command: correlate lsat gpa
_bs_1: r(rho)
| Observed Bootstrap
| Coef. Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | .77637442 .0014095 .13050844 .5205826 1.032166 (N)
| .4964501 .9629022 (P)
| .4021074 .94435 (BC)
| .3073767 .9353622 (BCa)
(N) normal confidence interval
(P) percentile confidence interval
(BC) bias-corrected confidence interval
(BCa) bias-corrected and accelerated confidence interval
The BCa CI is [0.307, 0.935].
An additional note is that most estimation commands allow you to request bootstrap CIs without using the bootstrap
command. To do this, you simply add vce(bootstrap, {options})
as an option to the command. For instance, if you want bootstrap CIs for the slope when regressing LSAT onto GPA, then you can do the following.
reg lsat gpa, vce(bootstrap, reps(1000) bca)
estat bootstrap, all
Linear regression Number of obs = 15
Replications = 1000
| Observed Bootstrap
lsat | Coef. Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
gpa | 133.25087 -.7776537 30.816018 72.85259 193.6492 (N)
| 66.07135 182.6677 (P)
| 64.51045 182.0473 (BC)
| 49.10582 177.9859 (BCa)
_cons | 187.89964 1.892627 91.972659 7.636535 368.1627 (N)
| 36.40686 390.0063 (P)
| 38.13368 390.4037 (BC)
| 50.42141 437.7433 (BCa)
(N) normal confidence interval
(P) percentile confidence interval
(BC) bias-corrected confidence interval
(BCa) bias-corrected and accelerated confidence interval
Writing a Bootstrap Program in Stata
We were able to easily bootstrap the correlation, because the correlate
command stored the variable that we needed r(rho)
. In cases where you are interested in bootstrapping a statistic that isn’t directly stored by a command, you need to create a program that stores the statistic that you need.
Writing a program will be necessary when your statistic of interest takes multiple Stata commands to calculate. Some examples are statistics that are:
- returned by a postestimation command
- a function of two returned statistics (ratio or difference)
- a transformation of a returned statistic
For an example, we will create a program to calculate the Fisher-\(z\) transformation of the correlation coefficient. When applied to the correlation coefficient, the Fisher-\(z\) transformation removes the [-1, 1] boundaries and results in a variable that is approximately normal.
\[ z = \frac{1}{2}\text{ln}\left(\frac{1+r}{1-r} \right) \]
We need the program to calculate the correlation, extract the value from r(rho)
, calculate the Fisher-\(z\) transformation, and return that value to r()
.
program define fisher, rclass
correlate lsat gpa
scalar rho = r(rho)
return scalar fisher_cor = 0.5 * log((1+rho)/(1-rho))
end
The three lines of the program do the following:
- calculate the correlation
- save the correlation to a local variable
- apply fisher-\(z\) transformation to the correlation and output the value
We first test this program on the whole dataset.
fisher
return list
scalars:
r(fisher_cor) = 1.036178305294536
We can see that the program works and stores the result in r(fisher_cor)
We then run the bootstrap
command using r(fisher_cor)
as the statistic and fisher
as the command.
bootstrap r(fisher_cor), reps(1000): fisher
estat bootstrap, all
Bootstrap results Number of obs = 15
Replications = 1000
command: fisher
_bs_1: r(fisher_cor)
| Observed Bootstrap
| Coef. Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | 1.0361783 .0764191 .36877182 .3133988 1.758958 (N)
| .5146473 1.963126 (P)
| .4691008 1.881243 (BC)
(N) normal confidence interval
(P) percentile confidence interval
(BC) bias-corrected confidence interval
Still have questions? Contact us!