RStudio Resources

Installing R/RSTUDIO

If you have a laptop that you plan to bring to class, you’re strongly encouraged to install RStudio on this laptop. Follow these two steps:

STEP 1: Download & install R from https://mirror.las.iastate.edu/CRAN/
STEP 2: Download & install RStudio from https://www.rstudio.com/products/rstudio/download/
Be sure to download the free version!!

Alternatively, if you don’t plan to use RStudio after this course, you can use Mac’s RStudio server at rstudio.macalester.edu.

Getting started in RStudio

RMarkdown

tidyverse: ggplot2 & dplyr

The tidyverse is a “collection of R packages designed for data science” which “share an underlying design philosophy, grammar, and data structures”⁴. The ggplot2 package that we use for visualizations and the dplyr package that we use for wrangling are part of the tidyverse. Both have “grammars” that are intuitive and generalizable once mastered (though it certainly takes practice).

Cheat sheets

More on dplyr

In the dplyr grammar, there are 5 verbs (actions):

verb	action
`arrange()`	reorder the rows
`filter()`	take a subset of rows
`select()`	take a subset of columns
`mutate()`	create a new variable, ie. column
`summarize()`	calculate a numerical summary of a variable, i.e. column

The general syntax for applying these verbs is below, where we call “%>%” a “pipe”:

my_dataset %>% 
  verb(___)

Just as we can add layers to a ggplot utilizing “+”, we can implement sequential data transformations utilizing “`%>%”:

my_dataset %>% 
  verb1(___) %>% 
  verb2(___)

Consider some examples using the Capital Bikeshare data:

# Load data
bike_data <- read.csv("https://www.macalester.edu/~ajohns24/data/bike_share.csv")

arrange()
We can rearrange the rows of a dataset to be in some meaningful order.

# arrange the rows by riders_total in ascending order
bike_data %>% 
  arrange(riders_total) %>% 
  head(3)
##       date season year month day_of_week weekend holiday temp_actual temp_feel
## 1 10/29/12   fall 2012   Oct         Mon   FALSE      no    64.47200  71.54600
## 2  1/27/11 winter 2011   Jan         Thu   FALSE      no    46.39100  51.77300
## 3 12/26/12 winter 2012   Dec         Wed   FALSE      no    49.95798  51.82997
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.880000 23.999400      categ3             2                20           22
## 2 0.687500  7.627079      categ1            15               416          431
## 3 0.823333 21.208582      categ3             9               432          441

# arrange the rows by riders_total in descending order
bike_data %>% 
  arrange(desc(riders_total)) %>% 
  head(3)
##      date season year month day_of_week weekend holiday temp_actual temp_feel
## 1 9/15/12 summer 2012   Sep         Sat    TRUE      no    76.89498  84.72803
## 2 9/29/12   fall 2012   Sep         Sat    TRUE      no    72.03650  79.72664
## 3 9/22/12 summer 2012   Sep         Sat    TRUE      no    79.97000  86.94392
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.501667  16.58391      categ1          3160              5554         8714
## 2 0.542917  15.24947      categ1          2589              5966         8555
## 3 0.646667  19.00006      categ1          2512              5883         8395

# store the arranged data
arranged_data <- bike_data %>% 
  arrange(desc(riders_total))

filter()
We’re not always interested in all rows of a dataset. filter() allows us to keep only certain rows that meet a given criterion. To write these criteria, we must specify the variable by which we want to filter the data and the value(s) of that variable that we want to keep. Here are some general rules:

If variable x is quantitative:
x == 1, x < 1, x <= 1, x > 1, x >= 1, x != 1
If variable x is categorical / factor:
x == "a", x != "a", x %in% c("a","b")
If variable x is logical (TRUE / FALSE):
x == TRUE, x == FALSE

# Only keep weekend data
weekend_data <- bike_data %>% 
    filter(weekend == TRUE)
head(weekend_data, 3)
##     date season year month day_of_week weekend holiday temp_actual temp_feel
## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
## 3 1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.805833  10.74988      categ2           331               654          985
## 2 0.696087  16.65211      categ2           131               670          801
## 3 0.535833  17.87587      categ2            68               891          959

# Only keep January data
jan_data <- bike_data %>% 
    filter(month == "Jan") %>% 
    mutate(month = droplevels(month))  # this gets rid of the other month labels being carried along
head(jan_data, 3)
##     date season year month day_of_week weekend holiday temp_actual temp_feel
## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.805833  10.74988      categ2           331               654          985
## 2 0.696087  16.65211      categ2           131               670          801
## 3 0.437273  16.63670      categ1           120              1229         1349

# Only keep January - March data
winter_data <- bike_data %>% 
    filter(month %in% c("Jan", "Feb", "Mar")) %>% 
    mutate(month = droplevels(month))  # this gets rid of the other month labels being carried along
head(winter_data, 3)
##     date season year month day_of_week weekend holiday temp_actual temp_feel
## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.805833  10.74988      categ2           331               654          985
## 2 0.696087  16.65211      categ2           131               670          801
## 3 0.437273  16.63670      categ1           120              1229         1349

# Only keep days that were colder than 45 (actual) degrees 
cold_data <- bike_data %>% 
    filter(temp_actual < 45)
head(cold_data, 3)
##      date season year month day_of_week weekend holiday temp_actual temp_feel
## 1  1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
## 2  1/9/11 winter 2011   Jan         Sun    TRUE      no    42.20898  42.45575
## 3 1/10/11 winter 2011   Jan         Mon   FALSE      no    43.13148  45.57992
##   humidity windspeed weather_cat riders_casual riders_registered riders_total
## 1 0.535833  17.87587      categ2            68               891          959
## 2 0.434167  24.25065      categ1            54               768          822
## 3 0.482917  14.95889      categ1            41              1280         1321


# Only keep data for January weekends that were colder than 45 degrees
# Two approaches
cold_jan_wknds <- bike_data %>% 
    filter(weekend == TRUE, month == "Jan", temp_actual < 45)
cold_jan_wknds <- bike_data %>% 
    filter(weekend == TRUE) %>% 
    filter(month == "Jan") %>% 
    filter(temp_actual < 45)

select()
There are often more variables (columns) in a dataset than we’re interested in. Removing the superfluous variables can make data analysis more computationally efficient and less overwhelming.

# Keep only riders_total and temp_actual
small_data <- bike_data %>% 
    select(riders_total, temp_actual)
head(small_data, 3)
##   riders_total temp_actual
## 1          985    57.39952
## 2          801    58.82468
## 3         1349    46.49166

# Keep everything BUT riders_total and temp_actual
other_data <- bike_data %>% 
    select(-riders_total, -temp_actual)
head(other_data, 3)
##     date season year month day_of_week weekend holiday temp_feel humidity
## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no  64.72625 0.805833
## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no  63.83651 0.696087
## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no  49.04645 0.437273
##   windspeed weather_cat riders_casual riders_registered
## 1  10.74988      categ2           331               654
## 2  16.65211      categ2           131               670
## 3  16.63670      categ1           120              1229

mutate()
We can create new variables (columns) by mutating existing columns. This is helpful when we want to change the scale of a variable, combine variables into new measurements, etc.

# Convert temp_actual from Fahrenheit to Celsius
new_bike_data <- bike_data %>% 
    mutate(temp_celsius = (temp_actual - 32)*5/9)
new_bike_data %>% 
    select(temp_actual, temp_celsius) %>% 
    head(3)
##   temp_actual temp_celsius
## 1    57.39952    14.110847
## 2    58.82468    14.902598
## 3    46.49166     8.050924

# Make a variable that indicates whether the temp was below 50 (fahrenheit)
new_bike_data <- bike_data %>% 
    mutate(is_cold = (temp_actual < 50))
new_bike_data %>% 
    select(temp_actual, is_cold) %>% 
    head(3)
##   temp_actual is_cold
## 1    57.39952   FALSE
## 2    58.82468   FALSE
## 3    46.49166    TRUE

summarize() (and group_by())
We can calculate numerical summaries of the variables (columns).

# Calculate the average and median total ridership
bike_data %>% 
    summarize(mean(riders_total), median(riders_total))
##   mean(riders_total) median(riders_total)
## 1           4504.349                 4548

We can also calculate summaries by groups defined by other variables (columns).

# Calculate the average total ridership on weekends vs weekdays
bike_data %>% 
    group_by(weekend) %>% 
    summarize(mean(riders_total))
## # A tibble: 2 x 2
##   weekend `mean(riders_total)`
##   <lgl>                  <dbl>
## 1 FALSE                  4551.
## 2 TRUE                   4390.

https://www.tidyverse.org/↩