Chapter 12 Modeling categorical relationships

So far we have discussed the general concept of statistical modeling and hypothesis testing, and applied them to some simple analyses. In this chapter we will focus on the modeling of categorical relationships, by which we mean relationships between variables that are measured on a nominal (and sometimes ordinal) scale. These data are usually expressed in terms of counts; that is, for each value of the variable (or combination of values of multiple variables), how many observations take that value? For example, when we count how many people from each major are in our class, we are fitting a categorical model to the data.

12.1 Example: Candy colors

Let’s say that I have purchased a bag of 100 candies, which are labeled as having 1/3 chocolates, 1/3 licorices, and 1/3 gumballs. When I count the candies in the bag, we get the following numbers:

candyDf <-
  tibble(
    candyType = c("chocolate", "licorice", "gumball"),
    count = c(30, 33, 37)
  )
pander(candyDf)
candyType count
chocolate 30
licorice 33
gumball 37

Because I like chocolate much more than licorice or gumballs, I feel slightly ripped off. What I would like to know is: What is the likelihood that the count would come out this way if the true probability of each candy type is the averaged proportion of 1/3 each?

12.2 Pearson’s chi-squared test

The Pearson chi-squared test provides us with a way to test whether observed count data differs from some specific expected values that define the null hypothesis:

\[ \chi^2 = \sum_i\frac{(observed_i - expected_i)^2}{expected_i} \]

In the case of our candy example, the null hypothesis is that the proportion of each type of candies is equal. We can compute the chi-squared statistic for our observed counts of candy as follows:

# compute chi-squared statistic

nullExpectation <- c(1 / 3, 1 / 3, 1 / 3) * sum(candyDf$count)

chisqVal <- 
  sum(
    ((candyDf$count - nullExpectation)**2) / nullExpectation
  )

The chi-squared statistic for this analysis comes out to 0.74, which on its own is not interpretable, since it depends on the number of different values that were added together. However, we can take advantage of the fact that the chi-squared statistic is distributed according to a specific distribution under the null hypothesis, which is known as the chi-squared distribution. This distribution is defined as the sum of squares of a set of standard normal random variables; it has a number of degrees of freedom that is equal to the number of variables being added together. The shape of the distribution depends on the number of degrees of freedom. Figure 12.1 shows examples of the distribution for several different degrees of freedom.

Examples of the chi-squared distribution for various degrees of freedom.

Figure 12.1: Examples of the chi-squared distribution for various degrees of freedom.

Let’s verify that the chi-squared distribution accurately describes the sum of squares of a set of standard normal random variables, using simulation.

# simulate 50,000 sums of 8 standard normal random variables and compare
# to theoretical chi-squared distribution

# create a matrix with 50k columns of 8 rows of squared normal random variables
d <- replicate(50000, rnorm(n = 8, mean = 0, sd = 1)**2) 
# sum each column of 8 variables
dMean <- apply(d, 2, sum)

# create a data frame of the theoretical chi-square distribution 
# with 8 degrees of freedom
csDf <-
  data.frame(x = seq(0.01, 30, 0.01)) %>%
  mutate(chisq = dchisq(x, 8))

Figure 12.2 shows that the theoretical distribution matches closely with the results of a simulation that repeatedly added together the squares of a set of random normal variables.

Simulation of sum of squared random normal variables.   The histogram is based on the sum of squares of 50,000 sets of 8 random normal variables; the blue line shows the values of the theoretical chi-squared distribution with 8 degrees of freedom.

Figure 12.2: Simulation of sum of squared random normal variables. The histogram is based on the sum of squares of 50,000 sets of 8 random normal variables; the blue line shows the values of the theoretical chi-squared distribution with 8 degrees of freedom.

For the candy example, we can compute the likelihood of our observed chi-squared value of 0.74 under the null hypothesis of equal frequency across all candies. We use a chi-squared distribution with degrees of freedom equal to k - 1 (where k = the number of categories) since we lost one degree of freedom when we computed the mean in order to generate the expected values.

pval <- pchisq(chisqVal, df = 2, lower.tail = FALSE) #df = degrees of freedom
sprintf("p-value = %0.3f", pval)
## [1] "p-value = 0.691"

This shows that the observed counts of candies are not particularly surprising based on the proportions printed on the bag of candy, and we would not reject the null hypothesis of equal proportions.

12.3 Contingency tables and the two-way test

Another way that we often use the chi-squared test is to ask whether two categorical variables are related to one another. As a more realistic example, let’s take the question of whether a black driver is more likely to be searched when they are pulled over by a police officer, compared to a white driver The Stanford Open Policing Project (https://openpolicing.stanford.edu/) has studied this, and provides data that we can use to analyze the question. We will use the data from the State of Connecticut since they are fairly small. These data were first cleaned up to remove all unnecessary data (see code/process_CT_data.py).

# load police stop data
stopData <-
  read_csv("data/CT_data_cleaned.csv") %>%
  rename(searched = search_conducted)

The standard way to represent data from a categorical analysis is through a contingency table, which presents the number or proportion of observations falling into each possible combination of values for each of the variables.

Let’s compute the contingency table for the police search data:

# compute and print two-way contingency table
summaryDf2way <-
  stopData %>%
  count(searched, driver_race) %>%
  arrange(driver_race, searched) 

summaryContingencyTable <-
  summaryDf2way %>%
  spread(driver_race, n)

pander(summaryContingencyTable)
searched Black White
FALSE 36244 239241
TRUE 1219 3108

It can also be useful to look at the contingency table using proportions rather than raw numbers, since they are easier to compare visually.

# Compute and print contingency table using proportions 
# rather than raw frequencies
summaryContingencyTableProportion <-
  summaryContingencyTable %>%
  mutate(
    Black = Black / nrow(stopData), #count of Black individuals searched / total searched
    White = White / nrow(stopData)
  )
pander(summaryContingencyTableProportion, round = 4)
searched Black White
FALSE 0.1295 0.855
TRUE 0.0044 0.0111

The Pearson chi-squared test allows us to test whether observed frequencies are different from expected frequencies, so we need to determine what frequencies we would expect in each cell if searches and race were unrelated – which we can define as being independent. Remember from the chapter on probability that if X and Y are independent, then:

\[ P(X \cap Y) = P(X) * P(Y) \] That is, the joint probability under the null hypothesis of independence is simply the product of the marginal probabilities of each individual variable. The marginal probabilities are simply the probabilities of each event occuring regardless of other events. We can compute those marginal probabilities, and then multiply them together to get the expected proportions under independence.

Black White
Not searched P(NS)*P(B) P(NS)*P(W) P(NS)
Searched P(S)*P(B) P(S)*P(W) P(S)
P(B) P(W)

We can use a linear algebra trick known as the “outer product” (via the outer() function) to compute this easily.

# first, compute the marginal probabilities

# probability of being each race
summaryDfRace <-
  stopData %>%
  count(driver_race) %>% #count the number of drivers of each race
  mutate(
    prop = n / sum(n) #compute the proportion of each race out of all drivers
  )

# probability of being searched 
summaryDfStop <-
  stopData %>%
  count(searched) %>% #count the number of searched vs. not searched
  mutate(
    prop = n / sum(n) # compute proportion of each outcome out all traffic stops
  )
# second, multiply outer product by n (all stops) to compute expected frequencies
expected <- outer(summaryDfRace$prop, summaryDfStop$prop) * nrow(stopData)

# create a data frame of expected frequencies for each race 
expectedDf <- 
  data.frame(expected, driverRace = c("Black", "White")) %>% 
  rename(
    NotSearched = X1,
    Searched = X2
  )

# tidy the data frame
expectedDfTidy <-
  gather(expectedDf, searched, n, -driverRace) %>%
  arrange(driverRace, searched)
# third, add expected frequencies to the original summary table
# and fourth, compute the standardized squared difference between 
# the observed and expected frequences

summaryDf2way <-
  summaryDf2way %>%
  mutate(expected = expectedDfTidy$n)

summaryDf2way <-
  summaryDf2way %>%
  mutate(stdSqDiff = (n - expected)**2 / expected)

pander(summaryDf2way)
searched driver_race n expected stdSqDiff
FALSE Black 36244 36883.67 11.09
TRUE Black 1219 579.33 706.31
FALSE White 239241 238601.3 1.71
TRUE White 3108 3747.67 109.18
# finally, compute chi-squared statistic by 
# summing the standardized squared differences
chisq <- sum(summaryDf2way$stdSqDiff)
sprintf("Chi-squared value = %0.2f", chisq)
## [1] "Chi-squared value = 828.30"

Having computed the chi-squared statistic, we now need to compare it to the chi-squared distribution in order to determine how extreme it is compared to our expectation under the null hypothesis. The degrees of freedom for this distribution are \(df = (nRows - 1) * (nColumns - 1)\) - thus, for a 2X2 table like the one here, \(df = (2-1)*(2-1)=1\). The intuition here is that computing the expected frequencies requires us to use three values: the total number of observations and the marginal probability for each of the two variables. Thus, once those values are computed, there is only one number that is free to vary, and thus there is one degree of freedom. Given this, we can compute the p-value for the chi-squared statistic:

pval <- pchisq(chisq, df = 1, lower.tail = FALSE)
sprintf("p-value = %e", pval)
## [1] "p-value = 3.795669e-182"

The p value of \(3.79e^{-182}\) is exceedingly small, showing that the observed data would be highly unlikely if there was truly no relationship between race and police searches, and thus we should reject the null hypothesis of independence.

We can also perform this test easily using the chisq.test() function in R:

# first need to rearrange the data into a 2x2 table
summaryDf2wayTable <-
  summaryDf2way %>%
  dplyr::select(-expected, -stdSqDiff) %>%
  spread(searched, n) %>%
  dplyr::select(-driver_race)

chisqTestResult <- chisq.test(summaryDf2wayTable, 1, correct = FALSE)
chisqTestResult
## 
##  Pearson's Chi-squared test
## 
## data:  summaryDf2wayTable
## X-squared = 800, df = 1, p-value <2e-16

12.4 Standardized residuals

When we find a significant effect with the chi-squared test, this tells us that the data are unlikely under the null hypothesis, but it doesn’t tell us how the data differ. To get a deeper insight into how the data differ from what we would expect under the null hypothesis, we can examine the residuals from a model, which reflects the deviation of the data (i.e., the observed frequencies) from the model in each cell (i.e., the expected frequencies). Rather than looking at the raw residuals (which will vary simply depending on the number of observations in the data), it’s more common to look at ther standardized residuals, which are computed as:

\[ standardized\ residual_{ij} = \frac{observed_{ij} - expected_{ij}}{\sqrt{expected_{ij}}} \] where \(i\) and \(j\) are the indices for the rows and columns respectively. We can compute these for the police stop data:

# compute standardized residuals
summaryDf2way <- 
  summaryDf2way %>% 
  mutate(stdRes = (n - expected)/sqrt(expected))

pander(summaryDf2way)
searched driver_race n expected stdSqDiff stdRes
FALSE Black 36244 36883.67 11.09 -3.33
TRUE Black 1219 579.33 706.31 26.58
FALSE White 239241 238601.3 1.71 1.31
TRUE White 3108 3747.67 109.18 -10.45

These standardized residuals can be interpreted as Z scores – in this case, we see that the number of searches for black individuals are substantially higher than expected based on independence, and the number of searches for white individuals are substantially lower than expected. This provides us with the context that we need to interpret the signficant chi-squared result.

12.5 Odds ratios

We can also represent the relative likelihood of different outcomes in the contingency table using the odds ratio that we introduced earlier, in order to better understand the size of the effect. First, we represent the odds of being stopped for each race:

\[ odds_{searched|black} = \frac{N_{searched\cap black}}{N_{not\ searched\cap black}} = \frac{1219}{36244} = 0.034 \]

\[ odds_{searched|white} = \frac{N_{searched\cap white}}{N_{not\ searched\cap white}} = \frac{3108}{239241} = 0.013 \] \[ odds\ ratio = \frac{odds_{searched|black}}{odds_{searched|white}} = 2.59 \]

The odds ratio shows that the odds of being searched are 2.59 times higher for black versus white drivers, based on this dataset.

12.6 Bayes factor

We discussed Bayes factors in the earlier chapter on Bayesian statistics – you may remember that it represents the ratio of the likelihood of the data under each of the two hypotheses: \[ K = \frac{P(data|H_A)}{P(data|H_0)} = \frac{P(H_A|data)*P(H_A)}{P(H_0|data)*P(H_0)} \] Bayes factors are similar to p-values and effect sizes in one way, which is that their interpretation is somewhat subjective. There are various guidelines for their interpretation – here is one from Kass and Raftery (1995) :

BF Interpretation
1 to 3 barely worth mention
3 to 20 positive
20 to 150 strong
150 and above very strong

We can compute the Bayes factor for the police search data using the contingencyTableBF() function from the BayesFactor package:

# compute Bayes factor 
# using independent multinomial sampling plan in which row totals (driver race)
# are fixed

bf <- 
  contingencyTableBF(as.matrix(summaryDf2wayTable),
  sampleType = "indepMulti",
  fixedMargin = "cols"
)
bf
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 1.8e+142 ±0%
## 
## Against denominator:
##   Null, independence, a = 1 
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial

This shows that the evidence in favor of a relationship between driver race and police searches in this dataset is exceedingly strong.

12.7 Categorical analysis beyond the 2 X 2 table

Categorical analysis can also be applied to contingency tables where there are more than two categories for each variable.

For example, let’s look at the NHANES data and compare the variable Depressed which denotes the “self-reported number of days where participant felt down, depressed or hopeless”. This variable is coded as None, Several, or Most. Let’s test whether this variable is related to the SleepTrouble variable which reports whether the individual has reported sleeping problems to a doctor.

# summarize depression as a function of sleep trouble
depressedSleepTrouble <-
  NHANES_adult %>%
  drop_na(SleepTrouble, Depressed) %>%
  count(SleepTrouble, Depressed) %>%
  arrange(SleepTrouble, Depressed)

depressedSleepTroubleTable <-
  depressedSleepTrouble %>%
  spread(SleepTrouble, n) %>% 
  rename(
    NoSleepTrouble = No,
    YesSleepTrouble = Yes
  )

pander(depressedSleepTroubleTable)
Depressed NoSleepTrouble YesSleepTrouble
None 2614 676
Several 418 249
Most 138 145

Simply by looking at these data, we can tell that it is likely that there is a relationship between the two variables; notably, while the total number of people with sleep trouble is much less than those without, for people who report being depresssed most days the number with sleep problems is greater than those without. We can quantify this directly using the chi-squared test; if our data frame contains only two variables, then we can simply provide the data frame as the argument to the chisq.test() function:

# need to remove the column with the label names
depressedSleepTroubleTable <-
  depressedSleepTroubleTable %>%
  dplyr::select(-Depressed)

depressedSleepChisq <- chisq.test(depressedSleepTroubleTable)
depressedSleepChisq
## 
##  Pearson's Chi-squared test
## 
## data:  depressedSleepTroubleTable
## X-squared = 200, df = 2, p-value <2e-16

This test shows that there is a strong relationship between depression and sleep trouble. We can also compute the Bayes factor to quantify the strength of the evidence in favor of the alternative hypothesis:

# compute bayes factor, using a joint multinomial sampling plan
bf <-
  contingencyTableBF(
    as.matrix(depressedSleepTroubleTable),
    sampleType = "jointMulti"
  )
bf
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 1.8e+35 ±0%
## 
## Against denominator:
##   Null, independence, a = 1 
## ---
## Bayes factor type: BFcontingencyTable, joint multinomial

Here we see that the Bayes factor is exceedingly large, showing that the evidence in favor of a relation between depression and sleep problems is very strong.

12.8 Beware of Simpson’s paradox

The contingency tables presented above represent summaries of large numbers of observations, but summaries can sometimes be misleading. Let’s take an example from baseball. The table below shows the batting data (hits/at bats and batting average) for Derek Jeter and David Justice over the years 1995-1997:

Player 1995 1996 1997 Combined
Derek Jeter 12/48 .250 183/582 .314 190/654 .291 385/1284 .300
David Justice 104/411 .253 45/140 .321 163/495 .329 312/1046 .298

If you look closely, you will see that something odd is going on: In each individual year Justice had a higher batting average than Jeter, but when we combine the data across all three years, Jeter’s average is actually higher than Justice’s! This is an example of a phenomenon known as Simpson’s paradox, in which a pattern that is present in a combined dataset may not be present in any of the subsets of the data. This occurs when there is another variable that may be changing across the different subsets – in this case, the number of at-bats varies across years, with Justice batting many more times in 1995 (when batting averages were low). We refer to this as a lurking variable, and it’s always important to be attentive to such variables whenever one examines categorical data.