RStudio Resources

Installing R/RSTUDIO

If you have a laptop that you plan to bring to class, you’re strongly encouraged to install RStudio on this laptop. Follow these two steps:

Alternatively, if you don’t plan to use RStudio after this course, you can use Mac’s RStudio server at rstudio.macalester.edu.



tidyverse: ggplot2 & dplyr

The tidyverse is a “collection of R packages designed for data science” which “share an underlying design philosophy, grammar, and data structures”4. The ggplot2 package that we use for visualizations and the dplyr package that we use for wrangling are part of the tidyverse. Both have “grammars” that are intuitive and generalizable once mastered (though it certainly takes practice).


Cheat sheets



More on dplyr

In the dplyr grammar, there are 5 verbs (actions):

verb action
arrange() reorder the rows
filter() take a subset of rows
select() take a subset of columns
mutate() create a new variable, ie. column
summarize() calculate a numerical summary of a variable, i.e. column


The general syntax for applying these verbs is below, where we call “%>%” a “pipe”:

Just as we can add layers to a ggplot utilizing “+”, we can implement sequential data transformations utilizing “`%>%”:


Consider some examples using the Capital Bikeshare data:

  1. arrange()
    We can rearrange the rows of a dataset to be in some meaningful order.


  1. filter()
    We’re not always interested in all rows of a dataset. filter() allows us to keep only certain rows that meet a given criterion. To write these criteria, we must specify the variable by which we want to filter the data and the value(s) of that variable that we want to keep. Here are some general rules:

    • If variable x is quantitative:
      x == 1, x < 1, x <= 1, x > 1, x >= 1, x != 1

    • If variable x is categorical / factor:
      x == "a", x != "a", x %in% c("a","b")

    • If variable x is logical (TRUE / FALSE):
      x == TRUE, x == FALSE

    # Only keep weekend data
    weekend_data <- bike_data %>% 
        filter(weekend == TRUE)
    head(weekend_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.535833  17.87587      categ2            68               891          959
    
    # Only keep January data
    jan_data <- bike_data %>% 
        filter(month == "Jan") %>% 
        mutate(month = droplevels(month))  # this gets rid of the other month labels being carried along
    head(jan_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.437273  16.63670      categ1           120              1229         1349
    
    # Only keep January - March data
    winter_data <- bike_data %>% 
        filter(month %in% c("Jan", "Feb", "Mar")) %>% 
        mutate(month = droplevels(month))  # this gets rid of the other month labels being carried along
    head(winter_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.437273  16.63670      categ1           120              1229         1349
    
    # Only keep days that were colder than 45 (actual) degrees 
    cold_data <- bike_data %>% 
        filter(temp_actual < 45)
    head(cold_data, 3)
    ##      date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1  1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
    ## 2  1/9/11 winter 2011   Jan         Sun    TRUE      no    42.20898  42.45575
    ## 3 1/10/11 winter 2011   Jan         Mon   FALSE      no    43.13148  45.57992
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.535833  17.87587      categ2            68               891          959
    ## 2 0.434167  24.25065      categ1            54               768          822
    ## 3 0.482917  14.95889      categ1            41              1280         1321
    
    
    # Only keep data for January weekends that were colder than 45 degrees
    # Two approaches
    cold_jan_wknds <- bike_data %>% 
        filter(weekend == TRUE, month == "Jan", temp_actual < 45)
    cold_jan_wknds <- bike_data %>% 
        filter(weekend == TRUE) %>% 
        filter(month == "Jan") %>% 
        filter(temp_actual < 45) 


  1. select()
    There are often more variables (columns) in a dataset than we’re interested in. Removing the superfluous variables can make data analysis more computationally efficient and less overwhelming.


  1. mutate()
    We can create new variables (columns) by mutating existing columns. This is helpful when we want to change the scale of a variable, combine variables into new measurements, etc.

 

  1. summarize() (and group_by())
    We can calculate numerical summaries of the variables (columns).

    We can also calculate summaries by groups defined by other variables (columns).