RStudio Resources

Installing R/RSTUDIO

There are two options for working with RStudio.

  • Option 1: install the software on your machine
    You are strongly encouraged to download the FREE RStudio so that you can have control over your own software. Take the following two steps in the given order. Even if you already have R/RStudio, you should update to the most recent versions: 4.0.3 for R and 1.3.1093 for RStudio.

    • STEP 1: Download & install R
      Click your preferred “mirror” from here (I use Iowa State University) and then click “Download R for …” depending upon what kind of machine you have. R is the engine behind RStudio, thus must be installed first.
    • STEP 2: Download & install RStudio
      Download the free version of “RStudio Desktop” from here. RStudio is essentially a more user-friendly dashboard for R.


  • Option 2: Use Mac’s RStudio server
    If you do not plan to use a personal machine or would rather not download software, you can access Macalester’s RStudio server at www.rstudio.macalester.edu. Once there, you can log in with your Mac username and password. WARNING: If you use the server, it is very important that you frequently export/download your files to your own computer – the server is occasionally scrubbed without warning.



tidyverse: ggplot2 & dplyr

The tidyverse is a “collection of R packages designed for data science” which “share an underlying design philosophy, grammar, and data structures”5. The ggplot2 package that we use for visualizations and the dplyr package that we use for wrangling are part of the tidyverse. Both have “grammars” that are intuitive and generalizable once mastered (though it certainly takes practice).


22.3.2 Demo videos

ggplot series



dplyr series


22.3.3 Written dplyr examples

In the dplyr grammar, there are 5 verbs (actions):

verb action
arrange() reorder the rows
filter() take a subset of rows
select() take a subset of columns
mutate() create a new variable, ie. column
summarize() calculate a numerical summary of a variable, i.e. column



The general syntax for applying these verbs is below, where we call “%>%” a “pipe”:

my_dataset %>% 
  verb(___)

Just as we can add layers to a ggplot utilizing “+”, we can implement sequential data transformations utilizing “`%>%”:

my_dataset %>% 
  verb1(___) %>% 
  verb2(___)


Consider some examples using the Capital Bikeshare data:

# Load data
#bike_data <- read.csv("https://www.macalester.edu/~ajohns24/data/bike_share.csv")
bike_data <- read.csv("https://www.dropbox.com/s/08smmgkaoj1ulhx/bike_share.csv?dl=1")
  1. arrange()
    We can rearrange the rows of a dataset to be in some meaningful order.

    # arrange the rows by riders_total in ascending order
    bike_data %>% 
      arrange(riders_total) %>% 
      head(3)
    ##       date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 10/29/12   fall 2012   Oct         Mon   FALSE      no    64.47200  71.54600
    ## 2  1/27/11 winter 2011   Jan         Thu   FALSE      no    46.39100  51.77300
    ## 3 12/26/12 winter 2012   Dec         Wed   FALSE      no    49.95798  51.82997
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.880000 23.999400      categ3             2                20           22
    ## 2 0.687500  7.627079      categ1            15               416          431
    ## 3 0.823333 21.208582      categ3             9               432          441
    
    # arrange the rows by riders_total in descending order
    bike_data %>% 
      arrange(desc(riders_total)) %>% 
      head(3)
    ##      date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 9/15/12 summer 2012   Sep         Sat    TRUE      no    76.89498  84.72803
    ## 2 9/29/12   fall 2012   Sep         Sat    TRUE      no    72.03650  79.72664
    ## 3 9/22/12 summer 2012   Sep         Sat    TRUE      no    79.97000  86.94392
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.501667  16.58391      categ1          3160              5554         8714
    ## 2 0.542917  15.24947      categ1          2589              5966         8555
    ## 3 0.646667  19.00006      categ1          2512              5883         8395
    
    # store the arranged data
    arranged_data <- bike_data %>% 
      arrange(desc(riders_total)) 


  1. filter()
    We’re not always interested in all rows of a dataset. filter() allows us to keep only certain rows that meet a given criterion. To write these criteria, we must specify the variable by which we want to filter the data and the value(s) of that variable that we want to keep. Here are some general rules:

    • If variable x is quantitative:
      x == 1, x < 1, x <= 1, x > 1, x >= 1, x != 1

    • If variable x is categorical / factor:
      x == "a", x != "a", x %in% c("a","b")

    • If variable x is logical (TRUE / FALSE):
      x == TRUE, x == FALSE

    # Only keep weekend data
    weekend_data <- bike_data %>% 
        filter(weekend == TRUE)
    head(weekend_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.535833  17.87587      categ2            68               891          959
    
    # Only keep January data
    jan_data <- bike_data %>% 
        filter(month == "Jan") %>% 
        mutate(month = droplevels(as.factor(month)))  # this gets rid of the other month labels being carried along
    head(jan_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.437273  16.63670      categ1           120              1229         1349
    
    # Only keep January - March data
    winter_data <- bike_data %>% 
        filter(month %in% c("Jan", "Feb", "Mar")) %>% 
        mutate(month = droplevels(as.factor(month)))  # this gets rid of the other month labels being carried along
    head(winter_data, 3)
    ##     date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no    57.39952  64.72625
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no    58.82468  63.83651
    ## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no    46.49166  49.04645
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.805833  10.74988      categ2           331               654          985
    ## 2 0.696087  16.65211      categ2           131               670          801
    ## 3 0.437273  16.63670      categ1           120              1229         1349
    
    # Only keep days that were colder than 45 (actual) degrees 
    cold_data <- bike_data %>% 
        filter(temp_actual < 45)
    head(cold_data, 3)
    ##      date season year month day_of_week weekend holiday temp_actual temp_feel
    ## 1  1/8/11 winter 2011   Jan         Sat    TRUE      no    44.17700  46.60286
    ## 2  1/9/11 winter 2011   Jan         Sun    TRUE      no    42.20898  42.45575
    ## 3 1/10/11 winter 2011   Jan         Mon   FALSE      no    43.13148  45.57992
    ##   humidity windspeed weather_cat riders_casual riders_registered riders_total
    ## 1 0.535833  17.87587      categ2            68               891          959
    ## 2 0.434167  24.25065      categ1            54               768          822
    ## 3 0.482917  14.95889      categ1            41              1280         1321
    
    
    # Only keep data for January weekends that were colder than 45 degrees
    # Two approaches
    cold_jan_wknds <- bike_data %>% 
        filter(weekend == TRUE, month == "Jan", temp_actual < 45)
    cold_jan_wknds <- bike_data %>% 
        filter(weekend == TRUE) %>% 
        filter(month == "Jan") %>% 
        filter(temp_actual < 45) 


  1. select()
    There are often more variables (columns) in a dataset than we’re interested in. Removing the superfluous variables can make data analysis more computationally efficient and less overwhelming.

    # Keep only riders_total and temp_actual
    small_data <- bike_data %>% 
        select(riders_total, temp_actual)
    head(small_data, 3)
    ##   riders_total temp_actual
    ## 1          985    57.39952
    ## 2          801    58.82468
    ## 3         1349    46.49166
    
    # Keep everything BUT riders_total and temp_actual
    other_data <- bike_data %>% 
        select(-riders_total, -temp_actual)
    head(other_data, 3)
    ##     date season year month day_of_week weekend holiday temp_feel humidity
    ## 1 1/1/11 winter 2011   Jan         Sat    TRUE      no  64.72625 0.805833
    ## 2 1/2/11 winter 2011   Jan         Sun    TRUE      no  63.83651 0.696087
    ## 3 1/3/11 winter 2011   Jan         Mon   FALSE      no  49.04645 0.437273
    ##   windspeed weather_cat riders_casual riders_registered
    ## 1  10.74988      categ2           331               654
    ## 2  16.65211      categ2           131               670
    ## 3  16.63670      categ1           120              1229


  1. mutate()
    We can create new variables (columns) by mutating existing columns. This is helpful when we want to change the scale of a variable, combine variables into new measurements, etc.

    # Convert temp_actual from Fahrenheit to Celsius
    new_bike_data <- bike_data %>% 
        mutate(temp_celsius = (temp_actual - 32)*5/9)
    new_bike_data %>% 
        select(temp_actual, temp_celsius) %>% 
        head(3)
    ##   temp_actual temp_celsius
    ## 1    57.39952    14.110847
    ## 2    58.82468    14.902598
    ## 3    46.49166     8.050924
    
    # Make a variable that indicates whether the temp was below 50 (fahrenheit)
    new_bike_data <- bike_data %>% 
        mutate(is_cold = (temp_actual < 50))
    new_bike_data %>% 
        select(temp_actual, is_cold) %>% 
        head(3)
    ##   temp_actual is_cold
    ## 1    57.39952   FALSE
    ## 2    58.82468   FALSE
    ## 3    46.49166    TRUE

 

  1. summarize() (and group_by())
    We can calculate numerical summaries of the variables (columns).

    # Calculate the average and median total ridership
    bike_data %>% 
        summarize(mean(riders_total), median(riders_total))
    ##   mean(riders_total) median(riders_total)
    ## 1           4504.349                 4548

    We can also calculate summaries by groups defined by other variables (columns).

    # Calculate the average total ridership on weekends vs weekdays
    bike_data %>% 
        group_by(weekend) %>% 
        summarize(mean(riders_total))
    ## # A tibble: 2 × 2
    ##   weekend `mean(riders_total)`
    ##   <lgl>                  <dbl>
    ## 1 FALSE                  4551.
    ## 2 TRUE                   4390.