21 Adding more layers

GOALS

Explore how to model grouped data when we have more than one grouping variable.

RECOMMENDED READING

21.1 Warm-up

A key question for a mountain climber is: will I successfully make it to the top? Our goal will be to build a model that can help answer this question, using the following data:

# Load packages
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(bayesplot)

# Import, rename, & clean data
data(climbers_sub)
climbers <- climbers_sub %>%
  select(peak_name, height_metres, expedition_id, season,
             member_id, success, age, expedition_role, oxygen_used)

WARM-UP EXERCISE 1

There are two grouping variables in this data. Identify the two variables and draw a diagram that represents the grouping structure of this data. NOTE: You’ll need to check out the data in RStudio.

WARM-UP EXERCISE 2

Define notation for our response variable \(Y_{ijk}\), whether or not a climber makes it to the top.

WARM-UP EXERCISE 3

Check out the potential predictors of success in the climbers data: height_metres, season, age, oxygen_used. For each, identify whether it tells us about one of the grouping variables or the individual climbers.

names(climbers)
## [1] "peak_name"       "height_metres"   "expedition_id"   "season"         
## [5] "member_id"       "success"         "age"             "expedition_role"
## [9] "oxygen_used"

WARM-UP EXERCISE 4

Write out Layer 1 of our hierarchical regression model of success \(Y\) by age \(X\).

Assume random intercepts but equal slopes.
Use the tweak notation. For example:
\[\begin{split} Y_{ij} | b_{0j}, \mu, \sigma_y & \sim N(\mu_j, \sigma_y^2) \; \text{ with } \; \mu_j = \mu + b_{0j} \\ Y_{ij} | b_{0j}, \beta_0, \beta_1, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \; \text{ with } \; \mu_{ij} = (b_{0j} + \beta_0) + \beta_1 X_{ij} \\ \end{split}\]

THE MODEL

The following diagram represents the nested group structure of our data. Climbers are grouped by expedition, and expeditions are grouped by peak:

Our model notation requires 3 subscripts. For climber \(i\) in expedition \(j\) to peak \(k\), \(Y_{ijk}\) is their success and \(X_{ijk}\) their age. Then an appropriate hierarchical model of the relationship between success and age is:

\[\begin{split} Y_{ijk}|\beta_0,\beta_1, b_{0j},p_{0k} & \sim \text{Bern}(\pi_{ijk}) \;\; \\ \text{ with } \; & \log\left(\frac{\pi_{ijk}}{1 - \pi_{ijk}}\right) = (\beta_0 + b_{0j} + p_{0k}) + \beta_1 X_{ijk} \\ b_{0j} | \sigma_b & \stackrel{ind}{\sim} N(0, \sigma_b^2) \\ p_{0k} | \sigma_p & \stackrel{ind}{\sim} N(0, \sigma_p^2) \\ \beta_{0c} & \sim N(m_0, s_0^2) \\ \beta_1 & \sim N(m_1, s_1^2) \\ \sigma_b & \sim \text{Exp}(l_b) \\ \sigma_p & \sim \text{Exp}(l_p). \\ \end{split}\]

where

\(\sigma_b\) = variability in success rates from expedition to expedition within a peak; and
\(\sigma_{p}\) = variability in success rates from peak to peak.

21.2 Exercises

In the exercises below, you’ll explore 3 models of success:

climb_model_1 is a complete pooled model that completely ignores the grouping structure of our data, thus incorrectly assumes that the climbing outcomes are independent.
climb_model_2 is a hierarchical model that acknowledges the groups related to peak_name, but ignores the expedition_id grouping variable, hence incorrectly assumes that climbing outcomes within the same expedition are independent.
climb_model_3 is a hierarchical model that acknowledges the groups related to both peak_name and expedition_id.

Import the simulation results:

climb_model_1 <- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_1.rds"))
climb_model_2 <- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_2.rds"))
climb_model_3 <- readRDS(url("https://www.macalester.edu/~ajohns24/data/stat454/climb_model_3.rds"))

The simulation syntax is included for reference, but don’t run it. It takes too long within the class context.

climb_model_1 <- stan_glm(
  success ~ height_metres + season + age + oxygen_used,
  data = climbers,
  family = binomial,
  chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)
climb_model_2 <- stan_glmer(
  success ~ height_metres + season + age + oxygen_used + (1 | peak_name),
  data = climbers,
  family = binomial,
  chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)
climb_model_3 <- stan_glmer(
  success ~ height_metres + season + age + oxygen_used + (1 | peak_name) + (1 | expedition_id),
  data = climbers,
  family = binomial,
  chains = 4, iter = 5000*2, seed = 84735, refresh = 0
)

Classification accuracy
The three models of climbing success all utilize the same data and predictors, yet make different assumptions about the structure of this data. This impacts the quality of their posterior predictions. If our priority is to correctly anticipate when a climber might not be successful (hence might not take the risk), which model is the best? NOTE: For point of comparison, utilize a 0.5 probability cut-off.
```
classification_summary(climb_model_1, data = climbers)
classification_summary(climb_model_2, data = climbers)
classification_summary(climb_model_3, data = climbers)
```

Understanding variability
In terms of understanding the variability in success from person to person, the complete pooled climb_model_1 just lumps everyone together. The hierarchical models provide more insight. For example, check out the posterior summary for \(\sigma_p\), the variability in baseline success rate from peak to peak, from climb_model_2:
```
tidy(climb_model_2, effects = "ran_pars")
```
climb_model_3 further partitions the variability in success into that explained by differences in peaks (\(\sigma_p\)) and that explained by differences in expeditions within peaks (\(\sigma_e\)):
```
tidy(climb_model_3, effects = "ran_pars")
```
1. Why did the posterior median estimate of the peak-to-peak variability \(\sigma_p\) decrease from climb_model_2 to climb_model_3?
2. As revealed by climb_model_3, what better explains the variability in success from climber to climber: differences between the peaks that they climb or differences between the expeditions in which they participate?

Coefficient interpretation
Let’s narrow our focus to climb_model_3. Check out the posterior summaries of the global regression coefficients:
```
tidy(climb_model_3, effects = "fixed", conf.int = TRUE, conf.level = 0.8)
```
1. Which factors / predictors are significantly positively associated with climbing success? Which are significantly negatively associated with climbing success?
2. Interpret the posterior median estimate of the oxygen_usedTRUE coefficient. Do so on the non-logged scale.

Could but won’t
We could but won’t…
- dig into specific peaks and expeditions
- use our model to predict the success of a climber based on the predictors in this model