21 Adding more layers
GOALS
Explore how to model grouped data when we have more than one grouping variable.
RECOMMENDED READING
Bayes Rules! Chapter 19.
21.1 Warm-up
A key question for a mountain climber is: will I successfully make it to the top? Our goal will be to build a model that can help answer this question, using the following data:
# Load packages
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(bayesplot)
# Import, rename, & clean data
data(climbers_sub)
<- climbers_sub %>%
climbers select(peak_name, height_metres, expedition_id, season,
member_id, success, age, expedition_role, oxygen_used)
WARM-UP EXERCISE 1
There are two grouping variables in this data. Identify the two variables and draw a diagram that represents the grouping structure of this data. NOTE: You’ll need to check out the data in RStudio.
WARM-UP EXERCISE 2
Define notation for our response variable \(Y_{ijk}\), whether or not a climber makes it to the top.
WARM-UP EXERCISE 3
Check out the potential predictors of success
in the climbers
data: height_metres
, season
, age
, oxygen_used
. For each, identify whether it tells us about one of the grouping variables or the individual climbers.
names(climbers)
## [1] "peak_name" "height_metres" "expedition_id" "season"
## [5] "member_id" "success" "age" "expedition_role"
## [9] "oxygen_used"
WARM-UP EXERCISE 4
Write out Layer 1 of our hierarchical regression model of success \(Y\) by age \(X\).
- Assume random intercepts but equal slopes.
- Use the tweak notation. For example:
\[\begin{split} Y_{ij} | b_{0j}, \mu, \sigma_y & \sim N(\mu_j, \sigma_y^2) \; \text{ with } \; \mu_j = \mu + b_{0j} \\ Y_{ij} | b_{0j}, \beta_0, \beta_1, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \; \text{ with } \; \mu_{ij} = (b_{0j} + \beta_0) + \beta_1 X_{ij} \\ \end{split}\]
THE MODEL
The following diagram represents the nested group structure of our data. Climbers are grouped by expedition, and expeditions are grouped by peak:
Our model notation requires 3 subscripts. For climber \(i\) in expedition \(j\) to peak \(k\), \(Y_{ijk}\) is their success and \(X_{ijk}\) their age. Then an appropriate hierarchical model of the relationship between success and age is:
\[\begin{split} Y_{ijk}|\beta_0,\beta_1, b_{0j},p_{0k} & \sim \text{Bern}(\pi_{ijk}) \;\; \\ \text{ with } \; & \log\left(\frac{\pi_{ijk}}{1 - \pi_{ijk}}\right) = (\beta_0 + b_{0j} + p_{0k}) + \beta_1 X_{ijk} \\ b_{0j} | \sigma_b & \stackrel{ind}{\sim} N(0, \sigma_b^2) \\ p_{0k} | \sigma_p & \stackrel{ind}{\sim} N(0, \sigma_p^2) \\ \beta_{0c} & \sim N(m_0, s_0^2) \\ \beta_1 & \sim N(m_1, s_1^2) \\ \sigma_b & \sim \text{Exp}(l_b) \\ \sigma_p & \sim \text{Exp}(l_p). \\ \end{split}\]
where
\(\sigma_b\) = variability in success rates from expedition to expedition within a peak; and
\(\sigma_{p}\) = variability in success rates from peak to peak.
21.2 Exercises
In the exercises below, you’ll explore 3 models of success
:
climb_model_1
is a complete pooled model that completely ignores the grouping structure of our data, thus incorrectly assumes that the climbing outcomes are independent.climb_model_2
is a hierarchical model that acknowledges the groups related topeak_name
, but ignores theexpedition_id
grouping variable, hence incorrectly assumes that climbing outcomes within the same expedition are independent.climb_model_3
is a hierarchical model that acknowledges the groups related to bothpeak_name
andexpedition_id
.
Import the simulation results:
<- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_1.rds"))
climb_model_1 <- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_2.rds"))
climb_model_2 <- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_3.rds")) climb_model_3
The simulation syntax is included for reference, but don’t run it. It takes too long within the class context.
<- stan_glm(
climb_model_1 ~ height_metres + season + age + oxygen_used,
success data = climbers,
family = binomial,
chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)<- stan_glmer(
climb_model_2 ~ height_metres + season + age + oxygen_used + (1 | peak_name),
success data = climbers,
family = binomial,
chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)<- stan_glmer(
climb_model_3 ~ height_metres + season + age + oxygen_used + (1 | peak_name) + (1 | expedition_id),
success data = climbers,
family = binomial,
chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)
Classification accuracy
The three models of climbingsuccess
all utilize the same data and predictors, yet make different assumptions about the structure of this data. This impacts the quality of their posterior predictions. If our priority is to correctly anticipate when a climber might not be successful (hence might not take the risk), which model is the best? NOTE: For point of comparison, utilize a 0.5 probability cut-off.classification_summary(climb_model_1, data = climbers) classification_summary(climb_model_2, data = climbers) classification_summary(climb_model_3, data = climbers)
Understanding variability
In terms of understanding the variability in success from person to person, the complete pooledclimb_model_1
just lumps everyone together. The hierarchical models provide more insight. For example, check out the posterior summary for \(\sigma_p\), the variability in baseline success rate from peak to peak, fromclimb_model_2
:tidy(climb_model_2, effects = "ran_pars")
climb_model_3
further partitions the variability in success into that explained by differences in peaks (\(\sigma_p\)) and that explained by differences in expeditions within peaks (\(\sigma_e\)):tidy(climb_model_3, effects = "ran_pars")
Why did the posterior median estimate of the peak-to-peak variability \(\sigma_p\) decrease from
climb_model_2
toclimb_model_3
?As revealed by
climb_model_3
, what better explains the variability in success from climber to climber: differences between the peaks that they climb or differences between the expeditions in which they participate?
Coefficient interpretation
Let’s narrow our focus toclimb_model_3
. Check out the posterior summaries of the global regression coefficients:tidy(climb_model_3, effects = "fixed", conf.int = TRUE, conf.level = 0.8)
- Which factors / predictors are significantly positively associated with climbing success? Which are significantly negatively associated with climbing success?
- Interpret the posterior median estimate of the
oxygen_usedTRUE
coefficient. Do so on the non-logged scale.
- Could but won’t
We could but won’t…- dig into specific peaks and expeditions
- use our model to predict the success of a climber based on the predictors in this model