# Chapter 11 Probability in R (with Lucy King)

In this chapter we will go over probability computations in R.

## 11.1 Basic probability calculations

Let’s create a vector of outcomes from one to 6, using the seq() function to create such a sequence:

outcomes <- seq(1, 6)
outcomes
##  1 2 3 4 5 6

Now let’s create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that == tests for equality of each element in a vector:

outcome1isTrue <- outcomes == 1
outcome1isTrue
##   TRUE FALSE FALSE FALSE FALSE FALSE

Rememer that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in R. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.

p1isTrue <- mean(outcome1isTrue)
p1isTrue
##  0.17

## 11.2 Empirical frequency (Section 10.2.2)

Let’s walk through how we computed empirical frequency of rain in San Francisco.

# we will remove the STATION and NAME variables
# since they are identical for all rows
dplyr::select(-STATION, -NAME)

glimpse(SFrain)
## Observations: 365
## Variables: 2
## $DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01… ##$ PRCP <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.82, 1.…

We see that the data frame contains a variable called PRCP which denotes the amount of rain each day. Let’s create a new variable called rainToday that denotes whether the amount of precipitation was above zero:

SFrain <-
SFrain %>%
mutate(rainToday = as.integer(PRCP > 0))

glimpse(SFrain)
## Observations: 365
## Variables: 3
## $DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 20… ##$ PRCP      <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.8…
## $rainToday <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, … Now we will summarize the data to compute the probability of rain: pRainInSF <- SFrain %>% summarize( pRainInSF = mean(rainToday) ) %>% pull() pRainInSF ##  0.2 ## 11.3 Conditional probability (Section 10.4) Let’s determine the conditional probability of someone being unhealthy, given that they are over 70 years of age, using the NHANES dataset. Let’s create a new data frame that healthDataFrame <- NHANES %>% mutate( Over70 = Age > 70, Unhealthy = DaysPhysHlthBad > 0 ) %>% dplyr::select(Unhealthy, Over70) %>% drop_na() glimpse(healthDataFrame) ## Observations: 4,891 ## Variables: 2 ##$ Unhealthy <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,…
## \$ Over70    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

First, what’s the probability of being over 70?

pOver70 <-
healthDataFrame %>%
summarise(pOver70 = mean(Over70)) %>%
pull()

# to obtain the specific value, we need to extract it from the data frame

pOver70
##  0.11

Second, what’s the probability of being unhealthy?

pUnhealthy <-
healthDataFrame %>%
summarise(pUnhealthy = mean(Unhealthy)) %>%
pull()

pUnhealthy
##  0.36

What’s the probability for each combination of unhealthy/healthly and over 70/ not? We can create a new variable that finds the joint probability by multiplying the two individual binary variables together; since anything times zero is zero, this will only have the value 1 for any case where both are true.

pBoth <- healthDataFrame %>%
mutate(
both = Unhealthy*Over70
) %>%
summarise(
pBoth = mean(both)) %>%
pull()

pBoth
##  0.043

Finally, what’s the probability of someone being unhealthy, given that they are over 70 years of age?

pUnhealthyGivenOver70 <-
healthDataFrame %>%
filter(Over70 == TRUE) %>% # limit to Over70
summarise(pUnhealthy = mean(Unhealthy)) %>%
pull()

pUnhealthyGivenOver70
##  0.38
# compute the opposite:
# what the probability of being over 70 given that
# one is unhealthy?
pOver70givenUnhealthy <-
healthDataFrame %>%
filter(Unhealthy == TRUE) %>% # limit to Unhealthy
summarise(pOver70 = mean(Over70)) %>%
pull()

pOver70givenUnhealthy
##  0.12