4 Visualizing & modeling variability

# Load packages
library(ggplot2)    
library(dplyr)    
library(readr)    
library(choroplethr)

4.1 Getting started

Find today’s notes at https://ajohns24.github.io/IMA_bootcamp_2020/visualizing-modeling-variability.html
Complete the following survey.
Download the “visualizing_modeling_variability_notes.Rmd” document from Canvas (Files > Statistics Material). Open in RStudio.
Open the group solutions document. This is where we’ll track solutions.
Knit your document. It’s helpful to do this early and often. If you get any errors, it’s likely because you haven’t yet installed the packages at the top of the document. If this is the case, return to Exercise 3.4 of the pre-bootcamp prep.

Overall workshop plan

The statistics & data science module, which will explore data analysis in the regression setting, can be centered within the broader machine learning workflow:

REGRESSION vs CLASSIFICATION vs CLUSTERING

APPROACH	TASK	DATA	GOAL
Supervised	regression	input: $x = (x_1, x_2,..., x_k)$ output: quantitative $y$	Model $y$ by $x$: $y = f(x) + \varepsilon$ Use this model to (1) predict $y$ for a new case; (2) explain variability in $y$; (3) explore relationships between $y$ and $x$
Supervised	classification	input: $x = (x_1, x_2,..., x_k)$ output: categorical $y$
Unsupervised	clustering	input: $x = (x_1, x_2,..., x_k)$ output: none	Examine similarities among cases with respect to their $x_i$ values. That is, explore the structure in the data.

NOTE: $f(x)$ captures the trend and $\varepsilon$ captures the error (deviations from trend)

Today’s plan

Welcome / intros
A motivating example with a pre-bootcamp recap
Visualizing relationships
Modeling relationships with linear regression

4.2 Pre-Bootcamp review

Statistics is the practice of using data from a sample to make inferences about some broader population of interest. Data is a key word here. In the pre-bootcamp prep, you explored the first simple steps of a statistical analysis: familiarizing yourself with the structure of your data and conducting some simple univariate analyses. We’ll build upon this foundation while getting to know each other a little better. To this end, read in data on the survey you just took:

survey <- read_csv("https://www.macalester.edu/~ajohns24/data/IMA/about_us_ima.csv")
names(survey) <- c("miles_away", "sleep", "activity", "star_wars", "birth_month")

Explore the structure of the data

# Check out the first rows of survey. ID the cases & variables    

# How much data do we have?    

# What are the variable names?

Explore activity (categorical)
Notice the eval = FALSE syntax in your Rmd. If you knit the Rmd now, it would not evaluate this chunk (if you did, you’d get an error). Once you complete / fix this code, remove the eval = FALSE syntax or replace it with eval = TRUE.
```
# Blank canvas
ggplot(___, aes(___))

# Bar chart    
ggplot(___, aes(___)) + 
    geom____()
```

Explore miles_away (quantitative)
(Remember to change eval = FALSE once you complete this code.)

# Histogram
ggplot(___, aes(x = ___)) + 
    geom____()

# Density plot
ggplot(___, aes(x = ___)) + 
    geom____()

Numerically summarize the trend & variability in miles_away

# Mean & median distance
___ %>% 
  ___(___(miles_away), ___(miles_away))

# Variance & st dev in in distance
___ %>% 
  ___(___(miles_away), ___(miles_away))

4.3 Visualizing relationships

There are more than 3000 counties in the U.S.:¹

Since the 2016 election, people have examined the following questions:

How did Trump’s support vary from county to county?
How does this compare to trends in past elections?
In what ways was Trump’s support associated with county demographics?

We’ll explore these questions using county-level election outcomes made available by Tony McGovern on github combined with county demographics from the df_county_demographics data set within the choroplethr package:

elect <- read.csv("https://www.macalester.edu/~ajohns24/Data/electionDemographics16.csv")

# There are more than 3000 counties in the U.S.
dim(elect)
## [1] 3112   34
head(elect)
##           county total_2008 dem_2008 gop_2008 oth_2008 total_2012 dem_2012
## 1  Walker County      28652     7420    20722      510      28497     6551
## 2 Bullock County       5415     4011     1391       13       5318     4058
## 3 Calhoun County      49242    16334    32348      560      46240    15500
## 4 Barbour County      11630     5697     5866       67      11459     5873
## 5 Fayette County       7957     1994     5883       80       7912     1803
## 6 Baldwin County      81413    19386    61271      756      84988    18329
##   gop_2012 oth_2012 total_2016 dem_2016 gop_2016 oth_2016 perdem_2016
## 1    21633      313      29243     4486    24208      549    15.34042
## 2     1250       10       4701     3530     1139       32    75.09041
## 3    30272      468      47376    13197    32803     1376    27.85588
## 4     5539       47      10390     4848     5431      111    46.66025
## 5     6034       75       8196     1358     6705      133    16.56906
## 6    65772      887      94090    18409    72780     2901    19.56531
##   perrep_2016 winrep_2016 perdem_2012 perrep_2012 winrep_2012 perdem_2008
## 1    82.78220        TRUE    22.98838    75.91325        TRUE    25.89697
## 2    24.22889       FALSE    76.30688    23.50508       FALSE    74.07202
## 3    69.23970        TRUE    33.52076    65.46713        TRUE    33.17087
## 4    52.27141        TRUE    51.25229    48.33755       FALSE    48.98538
## 5    81.80820        TRUE    22.78817    76.26390        TRUE    25.05970
## 6    77.35147        TRUE    21.56657    77.38975        TRUE    23.81192
##   perrep_2008 winrep_2008 region total_population percent_white percent_black
## 1    72.32305        TRUE   1127            66622            90             5
## 2    25.68790       FALSE   1011            10746            22            71
## 3    65.69189        TRUE   1015           117714            73            21
## 4    50.43852        TRUE   1005            27321            46            46
## 5    73.93490        TRUE   1057            17110            86            11
## 6    75.25948        TRUE   1003           187114            83             9
##   percent_asian percent_hispanic per_capita_income median_rent median_age
## 1             0                2             20698         402       41.8
## 2             0                6             18628         276       39.6
## 3             1                3             20828         422       38.7
## 4             0                5             16829         382       38.3
## 5             0                1             18494         287       43.5
## 6             1                4             26766         693       41.5
##   polyname abb StateColor
## 1  alabama  AL        red
## 2  alabama  AL        red
## 3  alabama  AL        red
## 4  alabama  AL        red
## 5  alabama  AL        red
## 6  alabama  AL        red

This data illuminates the following trends in Trump support from county to county:

# Visual summary
ggplot(elect, aes(x = perrep_2016)) + 
  geom_density()


# Numerical summary
elect %>% 
  summarize(mean(perrep_2016), sd(perrep_2016))
##   mean(perrep_2016) sd(perrep_2016)
## 1          63.62441        15.64131

The next natural question is this: what other variables (ie. county features) might explain some of this variability? Answering this question is one goal of statistical modeling. Specifically, statistical models illuminate relationships between a response variable and one or more predictors:

response variable
The variable whose variability we would like to explain or predict. eg: Trump’s vote percent
predictors
The variable(s) that might explain some of the variability in the response. eg: per capita income, state color, etc

Before building any models, the first crucial step in understanding relationships is to build informative visualizations.

Basic Rules for Constructing Visualizations

Instead of memorizing which plot is appropriate for which situation, it’s best to recognize patterns in constructing viz:

Each quantitative variable requires a new axis. If we run out of axes, we can illustrate the scale of a quantitative variable using color or discretize it into groups & treat it as categorical.

Each categorical variable requires a new way to “group” the graphic (eg: using colors, shapes, separate facets, etc to capture the grouping)

In breakout rooms:

Run through the following exercises which introduce different approaches to visualizing relationships. In doing so, for each plot:

examine what the plot tells us about relationship trends & strength (degree of variability from the trend) as well as outliers or deviations from the trend.
think: what’s the take-home message?
where relevant, provide a descriptive comment (#) to indicate the result of the syntax.

We’ll do the first exercise together.

Sketches
How might we visualize the relationships among the following pairs of variables? Checking out the structure on the just 4 counties might help:

elect %>% 
  filter(region %in% c(8039,28003,40129,29119)) %>% 
  select(c(perrep_2016, perrep_2012, median_rent, StateColor, abb))
##   perrep_2016 perrep_2012 median_rent StateColor abb
## 1    73.52922    72.52198         948       blue  CO
## 2    80.15304    72.84124         420     purple  MO
## 3    79.94723    75.10557         391        red  MS
## 4    87.94084    83.75149         367        red  OK

perrep_2016 vs perrep_2012
perrep_2016 vs StateColor
perrep_2016 vs perrep_2012 and StateColor (in 1 plot)
perrep_2016 vs perrep_2012 and median_rent (in 1 plot)

Scatterplots of 2 quantitative variables
By default, the response variable ($Y$) goes on the y-axis and the predictor ($X$) goes on the x-axis.

# 
ggplot(elect, aes(y = perrep_2016, x = perrep_2012))

# 
ggplot(elect, aes(y = perrep_2016, x = perrep_2012)) + 
    geom_point()

# 
ggplot(elect, aes(y = perrep_2016, x = perrep_2012)) + 
    geom_text(aes(label = abb))

# Practice: make a scatterplot of perrep_2016 vs median_rent

Side-by-side plots of 1 quantitative variable vs 1 categorical variable

# 
ggplot(elect, aes(x = perrep_2016, fill = StateColor)) + 
    geom_density()

# 
ggplot(elect, aes(x = perrep_2016, fill = StateColor)) + 
    geom_density(alpha = 0.5)

# 
ggplot(elect, aes(x = perrep_2016, fill = StateColor)) + 
    geom_density()  + 
    scale_fill_manual(values = c("blue","purple","red")) + 
    facet_wrap( ~ StateColor)

# 
ggplot(elect, aes(x = StateColor, y = perrep_2016)) + 
    geom_boxplot()

# Practice: construct side-by-side boxplots of the perrep_2016 across counties in each state

Scatterplots of 1 quantitative variable vs 1 categorical & 1 quantitative variable
If percent_white and StateColor both explain some of the variability in perrep_2016, why not include both in our analysis?! Let’s.
```
# 
ggplot(elect, aes(y = ___, x = ___, color = ___)) + 
    scale_color_manual(values = c("blue","purple","red")) + 
    geom_point(alpha = 0.5)
```

Plots of 3 quantitative variables
Think back to our sketches. How might we include information about median_rent in these plots?

# 
ggplot(elect, aes(y = perrep_2016, x = percent_white, ___ = median_rent)) + 
    geom_point(alpha = 0.5)

# 
ggplot(elect, aes(y = perrep_2016, x = percent_white, ___ = median_rent)) + 
    geom_point(alpha = 0.5)

Extra: Maps!

There is, of course, a geographical component to these data. Though we won’t cover spatial models in bootcamp, the visuals still help us tell a story. If the required choroplethrMaps package isn’t working for you, work with a classmate (don’t spend too much time with this picky package).

# Load packages
library(choroplethr)
library(choroplethrMaps)

# Map of Trump wins
map_data <- elect %>% 
  mutate(value = winrep_2016)
county_choropleth(map_data)

# Map of Trump support
map_data <- elect %>% 
  mutate(value = perrep_2016)
county_choropleth(map_data)

# Try another variable!

4.4 Linear regression models

Just as when exploring single variables, there are limitations in relying solely on visualizations to analyze relationships among 2+ variables. Statistical models provide rigorous numerical summaries of relationship trends.

Example
Before going into details, examine the plots below and draw a model that captures the trend of the relationships being illustrated.

Linear regression can be used to model each of these relationships. “Linear” here indicates that the linear regression model of a response variable is a linear combination of explanatory variables. It does not mean that the relationship itself is linear!! In general, let $Y$ be our response variable and ($X_1, X_2, ..., X_k$) be $k$ explanatory variables. Then the (population) linear regression model of $Y$ vs the $X_i$ is

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon\]

where

$\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k$ describes the trend in the relationship
$\varepsilon$ captures individual deviations from the trend (ie. residuals)
$\beta_0$ = intercept coefficient
average $Y$ value when $X_1=X_2=\cdots=X_k=0$
$\beta_i$ = $X_i$ coefficient
when holding constant all other $X_j$, the change $Y$ when we increase $X_i$ by 1

In RStudio, we construct sample estimates of linear regression models using the lm() (linear model) function. Consider a simple example:

my_model <- lm(y ~ x1 + x2, data = mydata)
summary(my_model)

Today we’ll focus on visualizing, constructing, and interpreting models. We’ll talk more tomorrow about model quality. IMPORTANT: Be sure to interpret the coefficients in a contextually meaningful way that tells the audience about the relationships of interest (as opposed to simply providing a definition).

Models with 1 quantitative predictor: perrep_2016 vs median_age
To begin, visualize the relationship:

ggplot(elect, aes(x = median_age, y = perrep_2016)) + 
    geom_point(alpha = 0.25) + 
    geom_smooth(method = "lm")

Build and summarize the regression model:

model_1 <- lm(perrep_2016 ~ median_age, data = elect)
summary(model_1)

Write out the estimated model trend and interpret all coefficients: perrep_2016 = ___ + ___ median_age
What’s the model’s prediction of Trump’s vote percent in a county with a median age of 40? A median age of 22? Plug these into part b and check your work using a shortcut function in R:
```
new_county <- data.frame(median_age = c(22,40))
predict(model_1, new_county)
```

Solution

# a
model_1 <- lm(perrep_2016 ~ median_age, data = elect)
summary(model_1)
## 
## Call:
## lm(formula = perrep_2016 ~ median_age, data = elect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.976  -9.131   2.596  11.028  33.472 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.48636    2.12404   11.53   <2e-16 ***
## median_age   0.96485    0.05195   18.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.84 on 3110 degrees of freedom
## Multiple R-squared:  0.09984,	Adjusted R-squared:  0.09955 
## F-statistic: 344.9 on 1 and 3110 DF,  p-value: < 2.2e-16

# b
#perrep_2016 = 24.48636 + 0.96485 median_age
# The intercept doesn't have contextual meaning since a median_age value of 0 doesn't make sense!
# For every extra year of median_age in a county, Trump's support increases by 24.5 percentage points

# c
new_county <- data.frame(median_age = c(22,40))
predict(model_1, new_county)
##        1        2 
## 45.71302 63.08029

Models with 1 categorical predictor: perrep_2016 vs StateColor

Visualize the relationship between these 2 variables and construct the model of this relationship.

# Visualize the relationship

# Build & summarize the model
model_2 <- ___(___, data = ___)
summary(___)

Write out the estimated model formula: perrep_2016 = ___ + ___ StateColorpurple + ___ StateColorred
Huh?! RStudio splits categorical predictors up into a reference group (the first alphabetically) and indicators for the other groups. Here, blue states are the reference group and \[\text{StateColorpurple} = \begin{cases} 1 & \text{ if purple} \\ 0 & \text{ otherwise} \\ \end{cases} \;\;\;\; \text{ and } \;\;\;\; \text{StateColorred} = \begin{cases} 1 & \text{ if red} \\ 0 & \text{ otherwise} \\ \end{cases}\] In other words, the StateColor variable is turned into 3 “dummy variables”: \[\left(\begin{array}{c} \text{red} \\ \text{purple} \\ \text{purple} \\ \text{blue} \\ \text{red} \end{array}\right) \;\;\; \to \;\;\; \text{...blue} = \left(\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \\ 0 \end{array}\right), \;\; \text{...purple} = \left(\begin{array}{c} 0 \\ 1 \\ 1 \\ 0 \\ 0 \end{array}\right), \;\; \text{...red} = \left(\begin{array}{c} 1 \\ 0 \\ 0 \\ 0 \\ 1 \end{array}\right)\] Since these sum to 1, we only need to put 2 into our model and leave the other out as a reference level. With these ideas in mind, interpret all coefficients in your model. HINT: Plug in 0’s and 1’s to obtain 3 separate models for the blue, purple, and red states.
How are the model coefficients related to the features of Trump’s support across counties in blue, purple, and red states? (That is, what’s the dual meaning of these coefficients?)

Solution

# a
model_2 <- lm(perrep_2016 ~ StateColor, data = elect)
summary(model_2)
## 
## Call:
## lm(formula = perrep_2016 ~ StateColor, data = elect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.613  -7.385   3.299  10.266  29.446 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       55.6653     0.5034 110.579   <2e-16 ***
## StateColorpurple   6.4764     0.7247   8.937   <2e-16 ***
## StateColorred     13.2692     0.6304  21.047   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.62 on 3109 degrees of freedom
## Multiple R-squared:  0.1274,	Adjusted R-squared:  0.1268 
## F-statistic: 226.9 on 2 and 3109 DF,  p-value: < 2.2e-16

# b
# perrep_2016 = 55.6653 + 6.4764 StateColorpurple + 13.2692 StateColorred

# c
# Based on the calculations below:
# - The typical Trump support in blue state counties was 55.6653%
# - The typical Trump support in purple state counties was 6.4764 percentage points HIGHER than in blue state counties
# - The typical Trump support in red state counties was 13.2692 percentage points HIGHER than in blue state counties    
# blue state prediction
55.6653 + 6.4764*0 + 13.2692*0
## [1] 55.6653
# purple state prediction
55.6653 + 6.4764*1 + 13.2692*0
## [1] 62.1417
# red state prediction
55.6653 + 6.4764*0 + 13.2692*1
## [1] 68.9345

# d
# These are related to the MEAN Trump support in blue, purple, and red states

Models with 1 quantitative predictor & 1 categorical predictor: perrep_2016 vs median_age and StateColor

First, build and visualize the model.

model_3 <- ___(___ ~ ___ + ___, data = ___)

# We have to "hard code" the model lines
ggplot(elect, aes(x = ___, y = ___, color = ___)) + 
  geom_point(alpha = 0.25) + 
  geom_line(aes(y = model_3$fitted.values)) +
  scale_color_manual(values=c("blue","purple","red"))

Interpret all model coefficients by applying the techniques you learned above. These interpretations will differ from those in model_1 and model_2 – coefficients are defined / interpreted differently depending upon what other predictors are in the model. HINT:
- First write out the estimated model formula:
  perrep_2016 = ___ + ___ median_age + ___ StateColorpurple + ___ StateColorred
- Then, notice that the presence of a categorical variable results in the separation of the model formula by group. To this end, plug in 0’s and 1’s to obtain 3 separate model formulas for the blue, purple, and red states:
  
  perrep_2016 = ___ + ___ median_age.

Solution

# a
model_3 <- lm(perrep_2016 ~ median_age + StateColor, data = elect)
ggplot(elect, aes(x = median_age, y = perrep_2016, color = StateColor)) + 
  geom_point(alpha = 0.25) + 
  geom_line(aes(y = model_3$fitted.values)) +
  scale_color_manual(values=c("blue","purple","red"))


# b
summary(model_3)
## 
## Call:
## lm(formula = perrep_2016 ~ median_age + StateColor, data = elect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.207  -7.179   2.295   9.409  35.252 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11.02443    2.02889   5.434 5.95e-08 ***
## median_age        1.07768    0.04767  22.609  < 2e-16 ***
## StateColorpurple  7.67020    0.67376  11.384  < 2e-16 ***
## StateColorred    14.57953    0.58719  24.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.55 on 3108 degrees of freedom
## Multiple R-squared:  0.2506,	Adjusted R-squared:  0.2499 
## F-statistic: 346.5 on 3 and 3108 DF,  p-value: < 2.2e-16
# trend for blue states: perrep_2016 = 11.02443 + 1.07768median_age
# trend for purple states: perrep_2016 = (11.02443 + 7.67020) + 1.07768median_age
# trend for red states: perrep_2016 = (11.02443 + 14.57953) + 1.07768median_age
# 11.02443 = intercepts for blue states
# 1.07768 = shared slope = no matter the state color, Trump support tends to increase by 1.07768 percentage points for every extra 1 year in median_age
# 7.67020 = how we CHANGE the intercept for purple states = when controlling for median_age, Trump support tends to be 7.67020 HIGHER in purple states than in blue states
# 14.57953 = how we CHANGE the intercept for red states = when controlling for median_age, Trump support tends to be 7.67020 HIGHER in purple states than in blue states

Models with 2 quantitative predictors: perrep_2016 vs median_age and median_rent

Visualize & construct a model of this relationship.
```
# Visualize


# Model
model_4 <- ___
```
If you were to draw this model, what would it look like?
Interpret the median_age coefficient.

Solution

# a
model_4 <- lm(perrep_2016 ~ median_age + median_rent, data = elect)
summary(model_4)
## 
## Call:
## lm(formula = perrep_2016 ~ median_age + median_rent, data = elect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.411  -6.969   1.904   8.843  41.137 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.160793   2.175080   26.28   <2e-16 ***
## median_age   0.625687   0.047275   13.23   <2e-16 ***
## median_rent -0.036905   0.001244  -29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 3109 degrees of freedom
## Multiple R-squared:  0.2984,	Adjusted R-squared:  0.2979 
## F-statistic: 661.1 on 2 and 3109 DF,  p-value: < 2.2e-16
ggplot(elect, aes(x = median_age, y = perrep_2016, color = median_rent)) + 
  geom_point(alpha = 0.25)


# b
# a plane

# c
# controlling for median_rent, Trump support tends to increase by 0.63 percentage points for every extra 1 year in median_age

Models with 2 categorical predictors: perrep_2016 vs StateColor and income_bracet
For illustration’s sake, let’s split county per_capita_income (in $) into 2 income brackets: above or below $25,000:

# Define income brackets
elect <- elect %>% 
  mutate(income_bracket = cut(per_capita_income, 
    breaks = c(0, 25000, 65000), labels = c("low","high")))

Visualize the relationship.

Construct the model & interpret all coefficients.

model_5 <- lm(perrep_2016 ~ StateColor + income_bracket, data = elect)
summary(model_5)

How many possible combinations are there of these two explanatory variables? Which combination has the highest predicted perrep_2016? The lowest?

Solution

# a (two of many options)
ggplot(elect, aes(y = perrep_2016, x = StateColor, color = income_bracket)) + 
  geom_boxplot()

ggplot(elect, aes(x = perrep_2016, fill = StateColor)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(~ income_bracket)


# b
model_5 <- lm(perrep_2016 ~ StateColor + income_bracket, data = elect)
summary(model_5)
## 
## Call:
## lm(formula = perrep_2016 ~ StateColor + income_bracket, data = elect)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.925  -7.388   3.076  10.154  30.146 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         58.3134     0.5772 101.036  < 2e-16 ***
## StateColorpurple     5.1963     0.7294   7.124  1.3e-12 ***
## StateColorred       11.9331     0.6398  18.651  < 2e-16 ***
## income_brackethigh  -5.1200     0.5673  -9.025  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.43 on 3108 degrees of freedom
## Multiple R-squared:  0.1497,	Adjusted R-squared:  0.1488 
## F-statistic: 182.3 on 3 and 3108 DF,  p-value: < 2.2e-16
#  58.3134 = typical Trump support in counties that are in BLUE states and LOWER income bracket 
#  5.1963: When controlling for income bracket, the typical Trump support in purple state counties tends to be 5.1963 percentage points higher than in blue state counties.
#  11.9331: When controlling for income bracket, the typical Trump support in red state counties tends to be 11.9331 percentage points higher than in blue state counties.
#  -5.1200: When controlling for state color, the typical Trump support in counties with higher avg income tends to be 5.1200 percentage points lower than counties with lower avg income.

#c 
# There are 6 combos.
# Trump's support tends to be highest among counties in RED states that have LOWER average income 
58.3134 + 11.9331
## [1] 70.2465
# Trump's support tends to be lowest among counties in BLUE states that have HIGHER average income 
58.3134 - 5.1200
## [1] 53.1934

4.5 Wrapping up

If you finish early:

The elect data set was actually compiled from 3 different sources:

the county_election_results data at https://www.macalester.edu/~dshuman1/ima/bootcamp/data/county_election_results.csv contains the 2016 county-level election results
the df_county_demographics data set within the choroplethr package contains county level demographic data
the red_blueStates data at https://www.macalester.edu/~ajohns24/Data/red_bluePurple.csv categorizes each county as belonging to a blue/red/purple state based on state categorizations at http://www.270towin.com/

It’s common to have to combine multiple data sources for an analysis. Let’s practice here:

# Load the 3 separate data sources
politics <- read.csv("https://www.macalester.edu/~dshuman1/ima/bootcamp/data/county_election_results.csv")
data("df_county_demographics")
red_blue <- read.csv("https://www.macalester.edu/~ajohns24/Data/RedBluePurple.csv")

# Check them out
dim(politics)
head(politics, 3)
dim(df_county_demographics)
head(df_county_demographics, 3)
dim(red_blue)
head(red_blue, 3)

# Join politics and df_county_demographics by their shared region variable
elect <- left_join(df_county_demographics, politics, by = c("region" = "region"))

# Join elect with red_blue
elect <- left_join(elect, red_blue, by = c("region" = "region"))

# Check it out
dim(elect)
head(elect, 3)

Start Homework 1. You should attempt to complete this homework before we meet again tomorrow.

Image source: https://commons.wikimedia.org/wiki/File:Map_of_USA_with_county_outlines_(black_%26_white).png ↩︎

APPROACH	TASK	DATA	GOAL
Supervised	regression	input: \(x = (x_1, x_2,..., x_k)\) output: quantitative \(y\)	Model \(y\) by \(x\): \(y = f(x) + \varepsilon\) Use this model to (1) predict \(y\) for a new case; (2) explain variability in \(y\); (3) explore relationships between \(y\) and \(x\)
Supervised	classification	input: \(x = (x_1, x_2,..., x_k)\) output: categorical \(y\)
Unsupervised	clustering	input: \(x = (x_1, x_2,..., x_k)\) output: none	Examine similarities among cases with respect to their \(x_i\) values. That is, explore the structure in the data.