Chapter 11 Probability in R (with Lucy King)
In this chapter we will go over probability computations in R.
11.1 Basic probability calculations
Let’s create a vector of outcomes from one to 6, using the
seq() function to create such a sequence:
##  1 2 3 4 5 6
Now let’s create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that
== tests for equality of each element in a vector:
##  TRUE FALSE FALSE FALSE FALSE FALSE
Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in R. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.
##  0.1666667
11.2 Empirical frequency (Section 10.2.2)
Let’s walk through how we computed empirical frequency of rain in San Francisco.
First we load the data:
## Observations: 365 ## Variables: 2 ## $ DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01… ## $ PRCP <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.82, 1.…
We see that the data frame contains a variable called
PRCP which denotes the amount of rain each day. Let’s create a new variable called
rainToday that denotes whether the amount of precipitation was above zero:
## Observations: 365 ## Variables: 3 ## $ DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 20… ## $ PRCP <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.8… ## $ rainToday <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
Now we will summarize the data to compute the probability of rain:
##  0.2
11.3 Conditional probability (Section 10.4)
Let’s determine the conditional probability of someone being unhealthy, given that they are over 70 years of age, using the NHANES dataset. Let’s create a new data frame that only contains people over 70 years old.
## Observations: 4,891 ## Variables: 2 ## $ Unhealthy <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,… ## $ Over70 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
First, what’s the probability of being over 70?
##  0.1106113
Second, what’s the probability of being unhealthy?
##  0.3567778
What’s the probability for each combination of unhealthy/healthy and over 70/ not? We can create a new variable that finds the joint probability by multiplying the two individual binary variables together; since anything times zero is zero, this will only have the value 1 for any case where both are true.
##  0.04252709
Finally, what’s the probability of someone being unhealthy, given that they are over 70 years of age?
##  0.3844732
##  0.1191977