5 Data wrangling I: dplyr verbs
Getting started
Download the “data_wrangling_notes.Rmd” and “model_quality_notes.Rmd” documents from Canvas and open in RStudio.
Open the group solutions document. This is where we’ll track our notes.
Check to see whether you have the
boot
,caret
, andgridExtra
packages installed:library(boot) library(caret) library(gridExtra)
If not, then install them from the console. For example:
install.packages("boot", dependencies = TRUE)
Today’s plan:
- Discuss yesterday’s homework & tie up any loose ends
- Explore two more important steps of the machine learning workflow:
- Data wrangling
- Evaluating model quality
5.1 dplyr verbs
Though models and algorithms often get all the glory in the fields of machine learning / statistics / data science, data wrangling is as or more important. We can know all the theory in the world, but if we don’t have facility in working with data, we won’t get anywhere. Today we’ll explore how to wrangle tidy data2 using the dplyr
package (which is part of the broader tidyverse which includes ggplot2
). In the dplyr
grammar, there are 6 main data transformation verbs (actions):
verb | action | example |
---|---|---|
select() |
take a subset of columns | select(x, y) , select(-x) |
mutate() |
create a new variable, ie. column | mutate(x = ___, y = ___) |
arrange() |
reorder the rows | arrange(x) , arrange(desc(x)) |
filter() |
take a subset of rows | filter(x == __, y > __) |
summarize() |
calculate a numerical summary of a variable | summarize(mean(x), median(y)) |
group_by() |
group the rows by a specified column | group_by(x) %>% summarize(mean(y)) |
We can apply one or more verbs through a series of pipes (%>%
):
# Apply 1 verb to my_data
%>%
my_data verb_1(___)
# Apply 2 verbs to my_data
%>%
my_data verb_1(___) %>%
verb_2(___)
5.2 Exercises
The fivethirtyeight article
The Ultimate Halloween Candy Power Ranking analyzed data produced from this experiment which presented subjects with a series of head-to-head candy match-ups and asked them to indicate which candy they preferred. You can load these data from the fivethirtyeight
package:
# Load data
library(fivethirtyeight)
data("candy_rankings")
# Examine the codebook
?candy_rankings
# Store under a shorter name
candy_rankings candy <-
We’ll wrangle these data below. In each exercise, first identify which dplyr
verbs will be useful.
Sort the candies
# Show the 6 least popular candies # Show the 6 most popular candies
Find the candies
# Show the data for Snickers # Create a data set with only the chocolate candies # Show the first 6 entries # Create a data set of the candies that won less than 45% of their match-ups # Show the first 6 entries # Show the chocolate candies that won less than 45% of their match-ups (do this 2 ways)
- Define new variables
Create a new data set,candy_new
with the following new variables:- Redefine
sugarpercent
from 0-1 scale to the 0-100 scale - Define a
choc_peanut
variable which identifies candies that contain both chocolate & peanuts/almonds
- Redefine
Calculate some stats
# Calculate the minimum, median, and maximum winpercent # Calculate the minimum, median, and maximum winpercent for chocolate vs non-chocolate candies
Pipe series
# Arrange the crispedricewafer candies from most to least popular # Among chocolate candies, what's the median winpercent? # Among chocolate candies, how does the median winpercent compare if the candy does / doesn't also have peanuts?
Each column is a variable and each row is a case. See Wickham, Tidy Data and Wickham and Grolemund, Tidy Data for more.↩︎