Chapter 7 Sampling

One of the foundational ideas in statistics is that we can make inferences about an entire population based on a relatively small sample of individuals from that population. In this chapter we will introduce the concept of statistical sampling and discuss why it works.

Anyone living in the United States will be familiar with the concept of sampling from the political polls that have become a central part of our electoral process. In some cases, these polls can be incredibly accurate at predicting the outcomes of elections. The best known example comes from the 2008 and 2012 US Presidential elections, when the pollster Nate Silver correctly predicted electoral outcomes for 49/50 states in 2008 and for all 50 states in 2012. Silver did this by combining data from 21 different polls, which vary in the degree to which they tend to lean towards either the Republican or Democratic side. Each of these polls included data from about 1000 likely voters – meaning that Silver was able to almost perfectly predict the pattern of votes of more than 125 million voters using data from only 21,000 people, along with other knowledge (such as how those states have voted in the past).

7.1 How do we sample?

Our goal in sampling is to determine some feature of the full population of interest, using just a small subset of the population. We do this primarily to save time and effort – why go to the trouble of measuring every individual in the population when just a small sample is sufficient to accurately estimate the variable of interest?

In the election example, the population is all voters, and the sample is the set of 1000 individuals selected by the polling organization. The way in which we select the sample is critical to ensuring that the sample is representative of the entire population, which is a main goal of statistical sampling. It’s easy to imagine a non-representative sample; if a pollster only called individuals whose names they had received from the local Democratic party, then it would be unlikely that the results of the poll would be representative of the population as a whole. In general, we would define a representative poll as being one in which every member of the population has an equal chance of being selected. When this fails, then we have to worry about whether the statistic that we compute on the sample is biased - that is, whether its value is systematically different from the population value (which we refer to as a parameter). Keep in mind that we generally don’t know this population parameter, because if we did then we wouldn’t need to sample! But we will use examples where we have access to the entire population, in order to explain some of the key ideas.

It’s important to also distinguish between two different ways of sampling: with replacement versus without replacement. In sampling with replacement, after a member of the population has been sampled, they are put back into the pool so that they can potentially be sampled again. In sampling without replacement, once a member has been sampled they are not eligible to be sampled again. It’s most common to use sampling without replacement, but there will be some contexts in which we will use sampling with replacement, as when we discuss a technique called bootstrapping in Chapter 8.

7.2 Sampling error

Regardless of how representative our sample is, it’s likely that the statistic that we compute from the sample is going to differ at least slightly from the population parameter. We refer to this as sampling error. The value of our statistical estimate will also vary from sample to sample; we refer to this distribution of our statistic across samples as the sampling distribution.

Sampling error is directly related to the quality of our measurement of the population. Clearly we want the estimates obtained from our sample to be as close as possible to the true value of the population parameter. However, even if our statistic is unbiased (that is, in the long run we expect it to have the same value as the population parameter), the value for any particular estimate will differ from the population estimate, and those differences will be greater when the sampling error is greater. Thus, reducing sampling error is an important step towards better measurement.

We will use the NHANES dataset as an example; we are going to assume that NHANES is the entire population, and then we will draw random samples from the population. We will have more to say in the next chapter about exactly how the generation of “random” samples works in a computer.

# load the NHANES data library
library(NHANES)

# create a NHANES dataset without duplicated IDs 
NHANES <-
  NHANES %>%
  distinct(ID, .keep_all = TRUE) 

#create a dataset of only adults
NHANES_adult <- 
  NHANES %>%
  filter( 
    !is.na(Height), 
    Age >= 18
  )

#print the NHANES population mean and standard deviation of adult height
sprintf(
  "Population height: mean = %.2f",
  mean(NHANES_adult$Height)
)
## [1] "Population height: mean = 168.35"
sprintf(
  "Population height: std deviation = %.2f",
  sd(NHANES_adult$Height)
)
## [1] "Population height: std deviation = 10.16"

In this example, we know the adult population mean and standard deviation for height because we are assuming that the NHANES dataset contains the entire population of adults. Now let’s take a single sample of 50 individuals from the NHANES population, and compare the resulting statistics to the population parameters.

# sample 50 individuals from NHANES dataset
exampleSample <- 
  NHANES_adult %>% 
  sample_n(50)

#print the sample mean and standard deviation of adult height
sprintf(
  'Sample height: mean = %.2f',
  mean(exampleSample$Height)
  )
## [1] "Sample height: mean = 169.46"
sprintf(
  'Sample height: std deviation = %.2f',
  sd(exampleSample$Height)
)
## [1] "Sample height: std deviation = 10.07"

The sample mean and standard deviation are similar but not exactly equal to the population values. Now let’s take a large number of samples of 50 individuals, compute the mean for each sample, and look at the resulting sampling distribution of means. We have to decide how many samples to take in order to do a good job of estimating the sampling distribution – in this case, let’s take 5000 samples so that we are really confident in the answer. Note that simulations like this one can sometimes take a few minutes to run, and might make your computer huff and puff. The histogram in Figure 7.1 shows that the means estimated for each of the samples of 50 individuals vary somewhat, but that overall they are centered around the population mean.

# compute sample means across 5000 samples from NHANES data
sampSize <- 50 # size of sample
nsamps <- 5000 # number of samples we will take

# set up variable to store all of the results
sampMeans <- array(NA, nsamps)

# Loop through and repeatedly sample and compute the mean
for (i in 1:nsamps) {
  NHANES_sample <- sample_n(NHANES_adult, sampSize)
  sampMeans[i] <- mean(NHANES_sample$Height)
}

sampdataDf <- tibble(mean = sampMeans)

sprintf(
  "Average sample mean = %.2f",
  mean(sampMeans)
)
## [1] "Average sample mean = 168.33"
sampMeans_df <- tibble(sampMeans = sampMeans)
The blue histogram shows the sampling distribution of the mean over 5000 random samples from the NHANES dataset.  The histogram for the full dataset is shown in gray for reference.

Figure 7.1: The blue histogram shows the sampling distribution of the mean over 5000 random samples from the NHANES dataset. The histogram for the full dataset is shown in gray for reference.

7.3 Standard error of the mean

Later in the course it will become essential to be able to characterize how variable our samples are, in order to make inferences about the sample statistics. For the mean, we do this using a quantity called the standard error of the mean (SEM), which one can think of as the standard deviation of the sampling distribution. If we know the population standard deviation, then we can compute the standard error using:

\[ SEM = \frac{\sigma}{\sqrt{n}} \] where \(n\) is the size of the sample. We don’t usually know \(\sigma\) (the population standard deviation), so instead we would usually plug in our estimate of \(\sigma\), which is the standard deviation computed on the sample (\(\hat{\sigma}\)):

\[ SEM = \frac{\hat{\sigma}}{\sqrt{n}} \]

However, we have to be careful about computing SEM using the estimated standard deviation if our sample is small (less than about 30).

Because we have many samples from the NHANES population and we actually know the population parameter, we can confirm that the SEM estimated using the population parameter is very close to the observed standard deviation of the samples that we took from the NHANES dataset.

# compare standard error based on population to standard deviation 
# of sample means

sprintf(
  'Estimated standard error based on population SD: %.2f',
  sd(NHANES_adult$Height)/sqrt(sampSize)
)
## [1] "Estimated standard error based on population SD: 1.44"
sprintf(
  'Standard deviation of sample means = %.2f',
  sd(sampMeans)
)
## [1] "Standard deviation of sample means = 1.43"

The formula for the standard error of the mean says that the quality of our measurement involves two quantities: the population variability, and the size of our sample. Of course, because the sample size is the denominator in the formula for SEM, a larger sample size will yield a smaller SEM when holding the population variability constant. We have no control over the population variability, but we do have control over the sample size. Thus, if we wish to improve our sample statistics (by reducing their sampling variability) then we should use larger samples. However, the formula also tells us something very fundamental about statistical sampling – namely, that the utility of larger samples diminishes with the square root of the sample size. This means that doubling the sample size will not double the quality of the statistics; rather, it will improve it by a factor of \(\sqrt{2}\). In Section 10.3 we will discuss statistical power, which is intimately tied to this idea.

7.4 The Central Limit Theorem

The Central Limit Theorem tells us that as sample sizes get larger, the sampling distribution of the mean will become normally distributed, even if the data within each sample are not normally distributed.

We can also see this in real data. Let’s work with the variable AlcoholYear in the NHANES distribution, which is highly skewed, as shown in Figure 7.2.

Distribution of the variable AlcoholYear in the NHANES dataset, which reflects the number of days that the individual drank in a year.

Figure 7.2: Distribution of the variable AlcoholYear in the NHANES dataset, which reflects the number of days that the individual drank in a year.

This distribution is, for lack of a better word, funky – and definitely not normally distributed. Now let’s look at the sampling distribution of the mean for this variable. Figure 7.3 shows the sampling distribution for this variable, which is obtained by repeatedly drawing samples of size 50 from the NHANES dataset and taking the mean. Despite the clear non-normality of the original data, the sampling distribution is remarkably close to the normal.

The sampling distribution of the mean for AlcoholYear in the NHANES dataset, obtained by drawing repeated samples of size 50, in blue.  The normal distribution with the same mean and standard deviation is shown in red.

Figure 7.3: The sampling distribution of the mean for AlcoholYear in the NHANES dataset, obtained by drawing repeated samples of size 50, in blue. The normal distribution with the same mean and standard deviation is shown in red.

7.5 Confidence intervals

Most people are familiar with the idea of a “margin of error” for political polls. These polls usually try to provide an answer that is accurate within +/- 3 percent. For example, when a candidate is estimated to win an election by 9 percentage points with a margin of error of 3, the percentage by which they will win is estimated to fall within 6-12 percentage points. In statistics we refer to this range of values as the confidence interval, which provides a measure of our degree of uncertainty about how close our estimate is to the population parameter. The larger the condidence interval, the greater our uncertainty.

We saw in the previous section that with sufficient sample size, the sampling distribution of the mean is normally distributed, and that the standard error describes the standard deviation of this sampling distribution. Using this knowledge, we can ask: What is the range of values within which we would expect to capture 95% of all estimates of the mean? To answer this, we can use the normal distribution, for which we know the values between which we expect 95% of all sample means to fall. Specifically, we use the quantile function for the normal distribution (qnorm() in R) to determine the values of the normal distribution that fall at the 2.5% and 97.5% points in the distribution. We choose these points because we want to find the 95% of values in the center of the distribution, so we need to cut off 2.5% on each end in order to end up with 95% in the middle. Figure 7.4 shows that this occurs for \(Z \pm 1.96\).

Normal distribution, with the orange section in the center denoting the range in which we expect 95 percent of all values to fall.  The green sections show the portions of the distribution that are more extreme, which we would expect to occur less than 5 percent of the time.

Figure 7.4: Normal distribution, with the orange section in the center denoting the range in which we expect 95 percent of all values to fall. The green sections show the portions of the distribution that are more extreme, which we would expect to occur less than 5 percent of the time.

Using these cutoffs, we can create a confidence interval for the estimate of the mean:

\[ CI_{95\%} = \bar{X} \pm 1.96*SEM \]

Let’s compute the confidence interval for the NHANES height data,

# compute confidence intervals

NHANES_sample <- sample_n(NHANES_adult,250)

sample_summary <- NHANES_sample %>%
    summarize(mean=mean(Height),
            sem=sd(Height)/sqrt(sampSize)) %>%
    mutate(CI_lower=mean-1.96*sem,
           CI_upper=mean+1.96*sem)
pander(sample_summary)
mean sem CI_lower CI_upper
166.869 1.446 164.036 169.702

Confidence intervals are notoriously confusing, primarily because they don’t mean what we would hope they mean. It seems natural to think that the 95% confidence interval tells us that there is a 95% chance that the population mean falls within the interval. However, as we will see throughout the course, concepts in statistics often don’t mean what we think they should mean. In the case of confidence intervals, we can’t interpret them in this way because the population parameter has a fixed value – it either is or isn’t in the interval. The proper interpretation of the 95% confidence interval is that it is the interval that will capture the true population mean 95% of the time. We can confirm this by resampling the NHANES data repeatedly and counting how often the interval contains the true population mean.

# compute how often the confidence interval contains the true population mean
nsamples <- 2500
sampSize <- 100

ci_contains_mean <- array(NA,nsamples)

for (i in 1:nsamples) {
  NHANES_sample <- sample_n(NHANES_adult, sampSize)
  sample_summary <- 
    NHANES_sample %>%
    summarize(
      mean = mean(Height),
      sem = sd(Height) / sqrt(sampSize)
    ) %>%
    mutate(
      CI_upper = mean + 1.96 * sem,
      CI_lower = mean - 1.96 * sem
    )
  ci_contains_mean[i] <- 
    (sample_summary$CI_upper > mean(NHANES_adult$Height)) & 
    (sample_summary$CI_lower < mean(NHANES_adult$Height))
}

sprintf(
  'proportion of confidence intervals containing population mean: %.3f',
  mean(ci_contains_mean)
)
## [1] "proportion of confidence intervals containing population mean: 0.953"

This confirms that the confidence interval does indeed capture the population mean about 95% of the time.

7.6 Suggested readings

  • The Signal and the Noise: Why So Many Predictions Fail - But Some Don’t, by Nate Silver