5 Data wrangling I: dplyr verbs

Getting started

  • Download the “data_wrangling_notes.Rmd” and “model_quality_notes.Rmd” documents from Canvas and open in RStudio.

  • Open the group solutions document. This is where we’ll track our notes.

  • Check to see whether you have the boot, caret, and gridExtra packages installed:

    library(boot)
    library(caret)
    library(gridExtra)

    If not, then install them from the console. For example:

    install.packages("boot", dependencies = TRUE)



Today’s plan:

  1. Discuss yesterday’s homework & tie up any loose ends
  2. Explore two more important steps of the machine learning workflow:
    • Data wrangling
    • Evaluating model quality



5.1 dplyr verbs

Though models and algorithms often get all the glory in the fields of machine learning / statistics / data science, data wrangling is as or more important. We can know all the theory in the world, but if we don’t have facility in working with data, we won’t get anywhere. Today we’ll explore how to wrangle tidy data2 using the dplyr package (which is part of the broader tidyverse which includes ggplot2). In the dplyr grammar, there are 6 main data transformation verbs (actions):



verb action example
select() take a subset of columns select(x, y), select(-x)
mutate() create a new variable, ie. column mutate(x = ___, y = ___)
arrange() reorder the rows arrange(x), arrange(desc(x))
filter() take a subset of rows filter(x == __, y > __)
summarize() calculate a numerical summary of a variable summarize(mean(x), median(y))
group_by() group the rows by a specified column group_by(x) %>% summarize(mean(y))



We can apply one or more verbs through a series of pipes (%>%):

# Apply 1 verb to my_data
my_data %>% 
  verb_1(___)

# Apply 2 verbs to my_data
my_data %>% 
  verb_1(___) %>% 
  verb_2(___)





5.2 Exercises

The fivethirtyeight article The Ultimate Halloween Candy Power Ranking analyzed data produced from this experiment which presented subjects with a series of head-to-head candy match-ups and asked them to indicate which candy they preferred. You can load these data from the fivethirtyeight package:

# Load data
library(fivethirtyeight)
data("candy_rankings")

# Examine the codebook
?candy_rankings

# Store under a shorter name
candy <- candy_rankings


We’ll wrangle these data below. In each exercise, first identify which dplyr verbs will be useful.



  1. Sort the candies

    # Show the 6 least popular candies
    
    # Show the 6 most popular candies



  1. Find the candies

    # Show the data for Snickers
    
    
    # Create a data set with only the chocolate candies
    # Show the first 6 entries
    
    
    # Create a data set of the candies that won less than 45% of their match-ups
    # Show the first 6 entries
    
    
    # Show the chocolate candies that won less than 45% of their match-ups (do this 2 ways)



  1. Define new variables
    Create a new data set, candy_new with the following new variables:
    • Redefine sugarpercent from 0-1 scale to the 0-100 scale
    • Define a choc_peanut variable which identifies candies that contain both chocolate & peanuts/almonds



  1. Calculate some stats

    # Calculate the minimum, median, and maximum winpercent
    
    # Calculate the minimum, median, and maximum winpercent for chocolate vs non-chocolate candies



  1. Pipe series

    # Arrange the crispedricewafer candies from most to least popular
    
    # Among chocolate candies, what's the median winpercent?
    
    # Among chocolate candies, how does the median winpercent compare if the candy does / doesn't also have peanuts?






  1. Each column is a variable and each row is a case. See Wickham, Tidy Data and Wickham and Grolemund, Tidy Data for more.↩︎