7 Data wrangling II: changing cases


Getting started:

install.packages("infer", dependencies = TRUE)
install.packages("broom", dependencies = TRUE)
install.packages("gsheet", dependencies = TRUE)
install.packages("babynames", dependencies = TRUE)



Today’s plan:

  1. Discuss weekend progress & tie up any other loose ends

  2. More data wrangling: changing cases

  3. Sampling variability / sampling distributions

  4. Confidence intervals



7.1 Spread, gather, and wide and narrow data formats

Additional reading:

As we are transforming data, it is important to keep in mind what constitutes each case (row) of the data. For example, in the initial BabyName data below, each case is a single name-sex-year combination. So if we have the same name and sex but a different year, that would be a different case.
Table 7.1: Each case is one name-sex-year combination.
year sex name count prop
1880 F Mary 7065 0.0723836
1880 F Anna 2604 0.0266790
1880 F Emma 2003 0.0205215
1880 F Elizabeth 1939 0.0198658
1880 F Minnie 1746 0.0178884
1880 F Margaret 1578 0.0161672

It is often necessary to rearrange your data in order to create visualizations, run statistical analysis, etc. We have already seen some ways to rearrange the data to change the case. For example, what is the case after performing the following command?

BabyNamesTotal<-BabyNames %>%
  group_by(name,sex) %>%
  summarise(total=sum(count))

Each case now represents one name-sex combination:

Table 7.2: Narrow format where each case is one name-sex combination.
name sex total
Aaban M 107
Aabha F 35
Aabid M 10
Aabir M 5
Aabriella F 32
Aada F 5

In this activity, we are going to learn two new operations to reshape and reorganize the data: spread() and gather().



7.1.1 Spread (pivot wider)


Example 7.1 We want to find the common names that are the most gender neutral (used roughly equally for males and females). How should we rearrange the data? Well, one nice way would be to have a single row for each name, and then have separate variables for the number of times that name is used for males and females. Using these two columns, we can then compute a third column that gives the ratio between these two columns. That is, we’d like to transform the data into a wide format with each of the possible values of the sex variable becoming its own column. The operation we need to perform this transformation is spread(). It takes a value (total in this case) representing the variable to be divided into multiple new variables, and a key (the original variable sex in this case) that identifies the variable in the initial narrow format data whose values should become the names of the new variables in the wide format data. The entry fill=0 specifies that if there are, e.g., no females named Aadam, we should include a zero in the corresponding entry of the wide format table.

BabyWide<-BabyNamesTotal %>%
  spread(key=sex,value=total,fill=0)
Table 7.3: A wide format with one case per name enables us to examine gender balance.
name F M
Aaban 0 107
Aabha 35 0
Aabid 0 10
Aabir 0 5
Aabriella 32 0
Aada 5 0

Now we can choose common names with frequency greater than 25,000 for both males and females, and sort by the ratio to identify gender-neutral names.

Neutral<-BabyWide %>%
  filter(M>25000,F>25000) %>%
  mutate(ratio = pmin(M/F,F/M)) %>%
  arrange(desc(ratio))
Table 7.4: The most gender-neutral common names, in wide format.
name F M ratio
Kerry 48534 49596 0.9785870
Riley 100881 92789 0.9197867
Quinn 28283 31230 0.9056356
Jackie 90604 78405 0.8653591
Frankie 33236 40552 0.8195897
Jaime 49673 67582 0.7350034
Casey 76020 110165 0.6900558
Peyton 69256 47682 0.6884891
Pat 40123 26731 0.6662264
Jessie 167010 110027 0.6588049
Kendall 58026 33821 0.5828594
Jody 55691 31206 0.5603419



Note: In Wickham and Grolemund on pivoting, you’ll see that a new version of spread(), called pivot_wider(), has recently been introduced in tidyr. The basic functionality is the same, but you might find the syntax easier to remember. The equivalent of

BabyWide<-BabyNamesTotal %>%
  spread(key=sex,value=total,fill=0)

is

BabyWide<-BabyNamesTotal %>% 
  pivot_wider(names_from=sex,values_from=total,values_fill=0)

Feel free to use pivot_wider instead of spread if you find that syntax easier to remember.

7.1.2 Gather (pivot longer)

Next, let’s filter these names to keep only those with a ratio of 0.5 or greater (no more than 2 to 1), and then switch back to narrow format. We can do this with the following gather() operation. It gathers the columns listed (F,M) at the end into a single column whose name is given by the key (sex), and includes the values in a column called total.

NeutralNarrow<-Neutral %>%
  filter(ratio>=.5) %>%
  gather(key=sex,value=total,F,M)%>%
  select(name,sex,total)%>%
  arrange(name)
Table 7.5: Narrow format for the most gender-neutral common names.
name sex total
Casey F 76020
Casey M 110165
Frankie F 33236
Frankie M 40552
Jackie F 90604
Jackie M 78405



Note: In Wickham and Grolemund on pivoting, you’ll see that a new version of gather(), called pivot_longer(), has recently been introduced in tidyr. The basic functionality is the same, but you might find the syntax easier to remember. The equivalent of

NeutralNarrow<-Neutral %>%
  filter(ratio>=.5) %>%
  gather(key=sex,value=total,F,M)%>%
  select(name,sex,total)%>%
  arrange(name)

is

NeutralNarrow<-Neutral %>%
  filter(ratio>=.5) %>%
  pivot_longer(c('F','M'),names_to="sex",values_to="total")%>%
  select(name,sex,total)%>%
  arrange(name)

Feel free to use pivot_longer instead of gather if you find that syntax easier to remember.

7.2 Summary graphic

Here is a nice summary graphic of gather and spread from the RStudio cheat sheet on data import:



7.3 The Daily Show guests

The data associated with this article is available in the fivethirtyeight package, and is loaded into Daily below. It includes a list of every guest to ever appear on Jon Stewart’s The Daily Show. Note that when multiple people appeared together, each person receives their own line.

Daily<-daily_show_guests
year google_knowledge_occupation show group raw_guest_list
1999 singer 1999-07-26 Musician Donny Osmond
1999 actress 1999-07-27 Acting Wendie Malick
1999 vocalist 1999-07-28 Musician Vince Neil
1999 film actress 1999-07-29 Acting Janeane Garofalo
1999 comedian 1999-08-10 Comedy Dom Irrera
1999 actor 1999-08-11 Acting Pierce Brosnan
1999 director 1999-08-12 Media Eduardo Sanchez and Daniel Myrick
1999 film director 1999-08-12 Media Eduardo Sanchez and Daniel Myrick
1999 american television personality 1999-08-16 Media Carson Daly
1999 actress 1999-08-17 Acting Molly Ringwald
1999 actress 1999-08-18 Acting Sarah Jessica Parker



7.3.2 Recreating a graphic

The original data has 18 different entries for the group variable:

unique(Daily$group)
##  [1] "Acting"         "Comedy"         "Musician"       "Media"         
##  [5] NA               "Politician"     "Athletics"      "Business"      
##  [9] "Advocacy"       "Political Aide" "Misc"           "Academic"      
## [13] "Government"     "media"          "Clergy"         "Science"       
## [17] "Consultant"     "Military"

In order to help you recreate the first figure from the article, I have added a new variable with three broader groups: (i) entertainment; (ii) politics, business, and government, and (iii) commentators. We will learn in the next activity what the inner_join in this code chunk is doing.

DailyGroups<-read_csv("https://www.macalester.edu/~dshuman1/data/112/daily-group-assignment.csv")
Daily<-Daily%>%
  inner_join(DailyGroups,by=c("group"="group"))
year google_knowledge_occupation show group raw_guest_list broad_group
1999 actor 1999-01-11 Acting Michael J. Fox Entertainment
1999 comedian 1999-01-12 Comedy Sandra Bernhard Entertainment
1999 television actress 1999-01-13 Acting Tracey Ullman Entertainment
1999 film actress 1999-01-14 Acting Gillian Anderson Entertainment
1999 actor 1999-01-18 Acting David Alan Grier Entertainment
1999 actor 1999-01-19 Acting William Baldwin Entertainment


Exercise 7.2 Using the group assignments contained in the broad_group variable, recreate the graphic from the article, with three different lines showing the fraction of guests in each group over time. Hint: first think about what your case should be for the glyph-ready form.



7.4 Gathering practice

A typical situation that requires a gather command is when the columns represent the possible values of a variable. Table 7.6 shows example data set from opendataforafrica.org with different years in different columns.

Lesotho<-read_csv("https://www.macalester.edu/~dshuman1/data/112/Lesotho.csv")
Table 7.6: Financial statistics about Lesotho.
Category 2010 2011 2012 2013 2014
Total Population 2.01 2.03 2.05 2.07 2.10
Gross Domestic Product 2242.30 2560.99 2494.60 2267.96 1929.28
Average Interest Rate on Loans 11.22 10.43 10.12 9.92 10.34
Inflation Rate 3.60 4.98 6.10 5.03 4.94
Average Interest Rate on Deposits 3.68 2.69 2.85 2.85 2.72


Exercise 7.3 (Gathering practice) Make a side-by-side bar chart with the year on the horizontal axis, and three side-by-side vertical columns for average interest rate on deposits, average interest rate on loans, and inflation rate for each year. In order to get the data into glyph-ready form, you’ll need to use gather. Hint: gather uses the dplyr::select() notation, so you can, e.g., list the columns you want to select, use colon notation, or use contains(a string). See Wickham and Grolemund for more information.