A Quick Look Into Bootstrapping

By Ted Musemwa

23 August 2024

0

1

Executive Summary

As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample.
Learn to bootstrap in R.
Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post).

Big Questions:

After an A/B testing, to what extent can we trust our small sample can represent the entire population of our customers?
If we repeatedly sample the same size, how would our estimates vary?
If we obtain different estimators after repeated sampling, can we gauge the distribution of the population?
If we don’t know the distribution of our variables, what solutions do we have?

What is bootstrapping?

Bootstrapping is a resampling method that allows us to gauge the distribution of the population from one sample distribution. We can estimate the population variance from a single sample in the following steps:

Draw N data points from the sample with replacements; the same observation can be resampled multiple times.
Refit the statistical models to the resampled bootstrapped data.
Calculate sample variance

Why bootstrap?

As data scientists, we have to make statistical inferences about the population distribution from a small sample.

For example, we conduct an A/B Testing, collect a sample of 100 customers, and find Version A generates more website traffic. The question is, can we interpret the results as all customers will find Version A more appealing?

It is possible what works for the customers in the sample may not work for the customers in the population.

This is a critical question because it’s not feasible to survey the entire population for our research questions.

To derive valid statistical inference, we have to rely on bootstrapping. Due to various reasons, we will create a large standard deviation of a point estimate when we sample, which may bias the estimator. We need to improve the accuracy by calculating the standard deviation of the estimator.

As a nonparametric estimator, bootstrapping comes handy and allow us to estimate the uncertainty of an estimator.

How to bootstrap in R?

Hypothetically, we roll a dice with two outcomes: head and tail. There is a 60% chance we will get the head each time. After 50 times, we obtain the following binomial distribution.

# create a binomial distribution 
# You may get slightly different results 
n <- 50
coin_flips <- rbinom(50, 1, p=0.6)
phat <- mean(coin_flips)
sd_hat <- sqrt(phat * (1-phat) / 50 )
print(sprintf(“Mean = %f, SD = %f”, phat, sd_hat))[1] “Mean = 0.600000, SD = 0.069282”

Following the classical approach, we calculate the mean and variance using a binomial distribution. The mean is 0.6 and the standard error is 0.069.

Now, let’s create a bootstrapped data and compare the results of these two methods.

# Bootstrap 1000 times 
B <- 1000
bootstrap_samples <- sapply(1:1000, function(i) mean(coin_flips[sample(n, replace=TRUE)]))# Plot the bootstrapped estimator
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap estimates of phat”)
curve(dnorm(x, phat, sd_hat), add=TRUE, col=”red”, lwd=2)
abline(v=0.6,col="black",lwd=4)

Let’s play with the bootstrapped data a little bit.

As explained above, it’s possible to sample the same observations repeatedly. So, how many repeated observations?

set.seed(1)
n=1000
included_obs = length(unique(sample(1:1000, replace = TRUE)))
included_obs
missing_obs = n-included_obs;missing_obs
missing_obs/n[1] 639
[1] 361
[1] 0.361

As can be seen, there are 1000 observations, 639 observations are unique, and 361 (or 36.1%) missed from the bootstrap sample.

How about the confidence interval?

set.seed(1)
n=1000
RC_shots = c(rep(1,50),rep(0,51))
bootstrap_samples <- sapply(1:1000, function(i) mean(RC_shots[sample(101, replace=TRUE)]))
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap Estimates of Sample Mean”)
quantile(bootstrap_samples,c(.025,.975))#95% C.I. end points    2.5%     97.5% 
0.4059406 0.5940594

The 95% bootstrap confidence interval is [0.4059406, 0.5940594].

Happy reading and learning!

Originally Posted Here

A Quick Look Into Bootstrapping

What is bootstrapping?

Why bootstrap?

How to bootstrap in R?

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US