More notebooks for Think Stats
As I mentioned in the previous post, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks. I am done with Chapters 1 through 6 now.
If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer. Or you can read (but not run) the notebooks on GitHub:
Chapter 4 Notebook (Chapter 4 Solutions)
Chapter 5 Notebook (Chapter 5 Solutions)
Chapter 6 Notebook (Chapter 6 Solutions)
I’ll post the next batch soon; in the meantime, here are some of the examples from Chapter 5, including one of my favorites: How tall is the tallest person in Pareto World?
In order to join Blue Man Group, you have to be male between 5’10” and 6’1” (see http://bluemancasting.com). What percentage of the U.S. male population is in this range? Hint: use
scipy.stats.norm.cdf
.scipy.stats
contains objects that represent analytic distributionsimport scipy.stats
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)
dist.mean(), dist.std()
dist.cdf(mu-sigma)
# Solution
low = dist.cdf(177.8) # 5'10"
high = dist.cdf(185.4) # 6'1"
low, high, high-low
Plot this distribution. What is the mean human height in Pareto world? What fraction of the population is shorter than the mean? If there are 7 billion people in Pareto world, how many do we expect to be taller than 1 km? How tall do we expect the tallest person to be?
scipy.stats.pareto
represents a pareto distribution. In Pareto world, the distribution of human heights has parameters alpha=1.7 and xmin=1 meter. So the shortest person is 100 cm and the median is 150.alpha = 1.7
xmin = 1 # meter
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()
# Solution
dist.mean()
# Solution
dist.cdf(dist.mean())
# Solution
(1 - dist.cdf(1000)) * 7e9, dist.sf(1000) * 7e9
# Solution
# One way to solve this is to search for a height that we
# expect one person out of 7 billion to exceed.
# It comes in at roughly 600 kilometers.
dist.sf(600000) * 7e9
# Solution
# Another way is to use `ppf`, which evaluates the "percent point function", which
# is the inverse CDF. So we can compute the height in meters that corresponds to
# the probability (1 - 1/7e9).
dist.ppf(1 - 1/7e9)
$mathrm{CDF}(x) = 1 − exp[−(x / λ)^k]$
Can you find a transformation that makes a Weibull distribution look like a straight line? What do the slope and intercept of the line indicate?
Use
random.weibullvariate
to generate a sample from a Weibull distribution and use it to test your transformation.Generate a sample from a Weibull distribution and plot it using a transform that makes a Weibull distribution look like a straight line.
thinkplot.Cdf
provides a transform that makes the CDF of a Weibull distribution look like a straight line.sample = [random.weibullvariate(2, 1) for _ in range(1000)]
cdf = thinkstats2.Cdf(sample)
thinkplot.Cdf(cdf, transform='weibull')
thinkplot.Config(xlabel='Weibull variate', ylabel='CCDF')
n
, we don’t expect an empirical distribution to fit an analytic distribution exactly. One way to evaluate the quality of fit is to generate a sample from an analytic distribution and see how well it matches the data.For example, in Section 5.1 we plotted the distribution of time between births and saw that it is approximately exponential. But the distribution is based on only 44 data points. To see whether the data might have come from an exponential distribution, generate 44 values from an exponential distribution with the same mean as the data, about 33 minutes between births.
Plot the distribution of the random values and compare it to the actual distribution. You can use random.expovariate to generate the values.