RStudio Resources
Installing R/RSTUDIO
There are two options for working with RStudio.
Option 1: install the software on your machine
You are strongly encouraged to download the FREE RStudio so that you can have control over your own software. Take the following two steps in the given order. Even if you already have R/RStudio, you should update to the most recent versions: 4.0.3 for R and 1.3.1093 for RStudio.- STEP 1: Download & install R
Click your preferred “mirror” from here (I use Iowa State University) and then click “Download R for …” depending upon what kind of machine you have. R is the engine behind RStudio, thus must be installed first. - STEP 2: Download & install RStudio
Download the free version of “RStudio Desktop” from here. RStudio is essentially a more user-friendly dashboard for R.
- STEP 1: Download & install R
- Option 2: Use Mac’s RStudio server
If you do not plan to use a personal machine or would rather not download software, you can access Macalester’s RStudio server at www.rstudio.macalester.edu. Once there, you can log in with your Mac username and password. WARNING: If you use the server, it is very important that you frequently export/download your files to your own computer – the server is occasionally scrubbed without warning.
Getting started in RStudio
tidyverse: ggplot2 & dplyr
The tidyverse is a “collection of R packages designed for data science” which “share an underlying design philosophy, grammar, and data structures”5. The ggplot2
package that we use for visualizations and the dplyr
package that we use for wrangling are part of the tidyverse. Both have “grammars” that are intuitive and generalizable once mastered (though it certainly takes practice).
22.3.2 Demo videos
ggplot series
complete Rmd for all videos in the series
video 1 = getting started + univariate viz
video 2 = bivariate viz
video 3 = multivariate viz
video 4 = customizing & learning more
dplyr series
complete Rmd for all videos in the series
video 1 =
select
&mutate
video 2 =
summarize
&group_by
video 3 =
filter
&arrange
22.3.3 Written dplyr examples
In the dplyr
grammar, there are 5 verbs (actions):
verb | action |
---|---|
arrange() |
reorder the rows |
filter() |
take a subset of rows |
select() |
take a subset of columns |
mutate() |
create a new variable, ie. column |
summarize() |
calculate a numerical summary of a variable, i.e. column |
The general syntax for applying these verbs is below, where we call “%>%
” a “pipe”:
%>%
my_dataset verb(___)
Just as we can add layers to a ggplot
utilizing “+
”, we can implement sequential data transformations utilizing “`%>%”:
%>%
my_dataset verb1(___) %>%
verb2(___)
Consider some examples using the Capital Bikeshare data:
# Load data
#bike_data <- read.csv("https://www.macalester.edu/~ajohns24/data/bike_share.csv")
<- read.csv("https://www.dropbox.com/s/08smmgkaoj1ulhx/bike_share.csv?dl=1") bike_data
arrange()
We can rearrange the rows of a dataset to be in some meaningful order.# arrange the rows by riders_total in ascending order %>% bike_data arrange(riders_total) %>% head(3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 10/29/12 fall 2012 Oct Mon FALSE no 64.47200 71.54600 ## 2 1/27/11 winter 2011 Jan Thu FALSE no 46.39100 51.77300 ## 3 12/26/12 winter 2012 Dec Wed FALSE no 49.95798 51.82997 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.880000 23.999400 categ3 2 20 22 ## 2 0.687500 7.627079 categ1 15 416 431 ## 3 0.823333 21.208582 categ3 9 432 441 # arrange the rows by riders_total in descending order %>% bike_data arrange(desc(riders_total)) %>% head(3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 9/15/12 summer 2012 Sep Sat TRUE no 76.89498 84.72803 ## 2 9/29/12 fall 2012 Sep Sat TRUE no 72.03650 79.72664 ## 3 9/22/12 summer 2012 Sep Sat TRUE no 79.97000 86.94392 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.501667 16.58391 categ1 3160 5554 8714 ## 2 0.542917 15.24947 categ1 2589 5966 8555 ## 3 0.646667 19.00006 categ1 2512 5883 8395 # store the arranged data <- bike_data %>% arranged_data arrange(desc(riders_total))
filter()
We’re not always interested in all rows of a dataset.filter()
allows us to keep only certain rows that meet a given criterion. To write these criteria, we must specify the variable by which we want to filter the data and the value(s) of that variable that we want to keep. Here are some general rules:If variable
x
is quantitative:
x == 1
,x < 1
,x <= 1
,x > 1
,x >= 1
,x != 1
If variable
x
is categorical / factor:
x == "a"
,x != "a"
,x %in% c("a","b")
If variable
x
is logical (TRUE
/FALSE
):
x == TRUE
,x == FALSE
# Only keep weekend data <- bike_data %>% weekend_data filter(weekend == TRUE) head(weekend_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/8/11 winter 2011 Jan Sat TRUE no 44.17700 46.60286 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.535833 17.87587 categ2 68 891 959 # Only keep January data <- bike_data %>% jan_data filter(month == "Jan") %>% mutate(month = droplevels(as.factor(month))) # this gets rid of the other month labels being carried along head(jan_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 46.49166 49.04645 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.437273 16.63670 categ1 120 1229 1349 # Only keep January - March data <- bike_data %>% winter_data filter(month %in% c("Jan", "Feb", "Mar")) %>% mutate(month = droplevels(as.factor(month))) # this gets rid of the other month labels being carried along head(winter_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 46.49166 49.04645 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.437273 16.63670 categ1 120 1229 1349 # Only keep days that were colder than 45 (actual) degrees <- bike_data %>% cold_data filter(temp_actual < 45) head(cold_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/8/11 winter 2011 Jan Sat TRUE no 44.17700 46.60286 ## 2 1/9/11 winter 2011 Jan Sun TRUE no 42.20898 42.45575 ## 3 1/10/11 winter 2011 Jan Mon FALSE no 43.13148 45.57992 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.535833 17.87587 categ2 68 891 959 ## 2 0.434167 24.25065 categ1 54 768 822 ## 3 0.482917 14.95889 categ1 41 1280 1321 # Only keep data for January weekends that were colder than 45 degrees # Two approaches <- bike_data %>% cold_jan_wknds filter(weekend == TRUE, month == "Jan", temp_actual < 45) <- bike_data %>% cold_jan_wknds filter(weekend == TRUE) %>% filter(month == "Jan") %>% filter(temp_actual < 45)
select()
There are often more variables (columns) in a dataset than we’re interested in. Removing the superfluous variables can make data analysis more computationally efficient and less overwhelming.# Keep only riders_total and temp_actual <- bike_data %>% small_data select(riders_total, temp_actual) head(small_data, 3) ## riders_total temp_actual ## 1 985 57.39952 ## 2 801 58.82468 ## 3 1349 46.49166 # Keep everything BUT riders_total and temp_actual <- bike_data %>% other_data select(-riders_total, -temp_actual) head(other_data, 3) ## date season year month day_of_week weekend holiday temp_feel humidity ## 1 1/1/11 winter 2011 Jan Sat TRUE no 64.72625 0.805833 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 63.83651 0.696087 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 49.04645 0.437273 ## windspeed weather_cat riders_casual riders_registered ## 1 10.74988 categ2 331 654 ## 2 16.65211 categ2 131 670 ## 3 16.63670 categ1 120 1229
mutate()
We can create new variables (columns) by mutating existing columns. This is helpful when we want to change the scale of a variable, combine variables into new measurements, etc.# Convert temp_actual from Fahrenheit to Celsius <- bike_data %>% new_bike_data mutate(temp_celsius = (temp_actual - 32)*5/9) %>% new_bike_data select(temp_actual, temp_celsius) %>% head(3) ## temp_actual temp_celsius ## 1 57.39952 14.110847 ## 2 58.82468 14.902598 ## 3 46.49166 8.050924 # Make a variable that indicates whether the temp was below 50 (fahrenheit) <- bike_data %>% new_bike_data mutate(is_cold = (temp_actual < 50)) %>% new_bike_data select(temp_actual, is_cold) %>% head(3) ## temp_actual is_cold ## 1 57.39952 FALSE ## 2 58.82468 FALSE ## 3 46.49166 TRUE
summarize()
(andgroup_by()
)
We can calculate numerical summaries of the variables (columns).# Calculate the average and median total ridership %>% bike_data summarize(mean(riders_total), median(riders_total)) ## mean(riders_total) median(riders_total) ## 1 4504.349 4548
We can also calculate summaries by groups defined by other variables (columns).
# Calculate the average total ridership on weekends vs weekdays %>% bike_data group_by(weekend) %>% summarize(mean(riders_total)) ## # A tibble: 2 × 2 ## weekend `mean(riders_total)` ## <lgl> <dbl> ## 1 FALSE 4551. ## 2 TRUE 4390.