RStudio
Handy packages & features
RStan
There are two options for installing RStan:
If you’re planning to use your own desktop version of RStudio, carefully follow these RStan install instructions. NOTE: It’s important to have R version 4.x.x before trying to install RStan.
If you’re planning to use Macalester’s RStudio server, or are having difficulties installing RStan on your own machine:
- You can use Mac’s server: rstudio.macalester.edu. Log in using the same info as your Mac email.
- Within RStudio, go to your home page by clicking the home icon at top right.
- Start a new session.
- Within that session, look at the top right of the screen to confirm that your session is in R version 4.1.0.
- The packages tab at lower right includes both a “User library” that includes packages you’ve installed on the server and a “System library” that includes packages that ITS pre-installed for you. You will need to uninstall
ggplot2
from your User library by clicking the x to the right of that package. Similarly, you may need to uninstall other packages that are both in the User and System library – the duplicates can interfere with RStan.
On RStudio cloud you can import an older version of
rstanarm
as follows:remotes::install_version("rstanarm", version = "2.19.3", repos = 'https://packagemanager.rstudio.com/all/__linux__/focal/latest')
tidyverse
The tidyverse is a “collection of R packages designed for data science” which “share an underlying design philosophy, grammar, and data structures”7. The ggplot2
package that we use for visualizations and the dplyr
package that we use for wrangling are part of the tidyverse. Both have “grammars” that are intuitive and generalizable once mastered (though it certainly takes practice).
Cheat sheets
More on dplyr
In the dplyr
grammar, there are 5 verbs (actions):
verb | action |
---|---|
arrange() |
reorder the rows |
filter() |
take a subset of rows |
select() |
take a subset of columns |
mutate() |
create a new variable, ie. column |
summarize() |
calculate a numerical summary of a variable, i.e. column |
The general syntax for applying these verbs is below, where we call “%>%
” a “pipe”:
%>%
my_dataset verb(___)
Just as we can add layers to a ggplot
utilizing “+
,” we can implement sequential data transformations utilizing “`%>%”:
%>%
my_dataset verb1(___) %>%
verb2(___)
Consider some examples using the Capital Bikeshare data:
# Load data
<- read.csv("https://www.macalester.edu/~ajohns24/data/bike_share.csv") bike_data
arrange()
We can rearrange the rows of a dataset to be in some meaningful order.# arrange the rows by riders_total in ascending order %>% bike_data arrange(riders_total) %>% head(3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 10/29/12 fall 2012 Oct Mon FALSE no 64.47200 71.54600 ## 2 1/27/11 winter 2011 Jan Thu FALSE no 46.39100 51.77300 ## 3 12/26/12 winter 2012 Dec Wed FALSE no 49.95798 51.82997 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.880000 23.999400 categ3 2 20 22 ## 2 0.687500 7.627079 categ1 15 416 431 ## 3 0.823333 21.208582 categ3 9 432 441 # arrange the rows by riders_total in descending order %>% bike_data arrange(desc(riders_total)) %>% head(3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 9/15/12 summer 2012 Sep Sat TRUE no 76.89498 84.72803 ## 2 9/29/12 fall 2012 Sep Sat TRUE no 72.03650 79.72664 ## 3 9/22/12 summer 2012 Sep Sat TRUE no 79.97000 86.94392 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.501667 16.58391 categ1 3160 5554 8714 ## 2 0.542917 15.24947 categ1 2589 5966 8555 ## 3 0.646667 19.00006 categ1 2512 5883 8395 # store the arranged data <- bike_data %>% arranged_data arrange(desc(riders_total))
filter()
We’re not always interested in all rows of a dataset.filter()
allows us to keep only certain rows that meet a given criterion. To write these criteria, we must specify the variable by which we want to filter the data and the value(s) of that variable that we want to keep. Here are some general rules:If variable
x
is quantitative:
x == 1
,x < 1
,x <= 1
,x > 1
,x >= 1
,x != 1
If variable
x
is categorical / factor:
x == "a"
,x != "a"
,x %in% c("a","b")
If variable
x
is logical (TRUE
/FALSE
):
x == TRUE
,x == FALSE
# Only keep weekend data <- bike_data %>% weekend_data filter(weekend == TRUE) head(weekend_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/8/11 winter 2011 Jan Sat TRUE no 44.17700 46.60286 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.535833 17.87587 categ2 68 891 959 # Only keep January data <- bike_data %>% jan_data filter(month == "Jan") head(jan_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 46.49166 49.04645 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.437273 16.63670 categ1 120 1229 1349 # Only keep January - March data <- bike_data %>% winter_data filter(month %in% c("Jan", "Feb", "Mar")) head(winter_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/1/11 winter 2011 Jan Sat TRUE no 57.39952 64.72625 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 58.82468 63.83651 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 46.49166 49.04645 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.805833 10.74988 categ2 331 654 985 ## 2 0.696087 16.65211 categ2 131 670 801 ## 3 0.437273 16.63670 categ1 120 1229 1349 # Only keep days that were colder than 45 (actual) degrees <- bike_data %>% cold_data filter(temp_actual < 45) head(cold_data, 3) ## date season year month day_of_week weekend holiday temp_actual temp_feel ## 1 1/8/11 winter 2011 Jan Sat TRUE no 44.17700 46.60286 ## 2 1/9/11 winter 2011 Jan Sun TRUE no 42.20898 42.45575 ## 3 1/10/11 winter 2011 Jan Mon FALSE no 43.13148 45.57992 ## humidity windspeed weather_cat riders_casual riders_registered riders_total ## 1 0.535833 17.87587 categ2 68 891 959 ## 2 0.434167 24.25065 categ1 54 768 822 ## 3 0.482917 14.95889 categ1 41 1280 1321 # Only keep data for January weekends that were colder than 45 degrees # Two approaches <- bike_data %>% cold_jan_wknds filter(weekend == TRUE, month == "Jan", temp_actual < 45) <- bike_data %>% cold_jan_wknds filter(weekend == TRUE) %>% filter(month == "Jan") %>% filter(temp_actual < 45)
select()
There are often more variables (columns) in a dataset than we’re interested in. Removing the superfluous variables can make data analysis more computationally efficient and less overwhelming.# Keep only riders_total and temp_actual <- bike_data %>% small_data select(riders_total, temp_actual) head(small_data, 3) ## riders_total temp_actual ## 1 985 57.39952 ## 2 801 58.82468 ## 3 1349 46.49166 # Keep everything BUT riders_total and temp_actual <- bike_data %>% other_data select(-riders_total, -temp_actual) head(other_data, 3) ## date season year month day_of_week weekend holiday temp_feel humidity ## 1 1/1/11 winter 2011 Jan Sat TRUE no 64.72625 0.805833 ## 2 1/2/11 winter 2011 Jan Sun TRUE no 63.83651 0.696087 ## 3 1/3/11 winter 2011 Jan Mon FALSE no 49.04645 0.437273 ## windspeed weather_cat riders_casual riders_registered ## 1 10.74988 categ2 331 654 ## 2 16.65211 categ2 131 670 ## 3 16.63670 categ1 120 1229
mutate()
We can create new variables (columns) by mutating existing columns. This is helpful when we want to change the scale of a variable, combine variables into new measurements, etc.# Convert temp_actual from Fahrenheit to Celsius <- bike_data %>% new_bike_data mutate(temp_celsius = (temp_actual - 32)*5/9) %>% new_bike_data select(temp_actual, temp_celsius) %>% head(3) ## temp_actual temp_celsius ## 1 57.39952 14.110847 ## 2 58.82468 14.902598 ## 3 46.49166 8.050924 # Make a variable that indicates whether the temp was below 50 (fahrenheit) <- bike_data %>% new_bike_data mutate(is_cold = (temp_actual < 50)) %>% new_bike_data select(temp_actual, is_cold) %>% head(3) ## temp_actual is_cold ## 1 57.39952 FALSE ## 2 58.82468 FALSE ## 3 46.49166 TRUE
summarize()
(andgroup_by()
)
We can calculate numerical summaries of the variables (columns).# Calculate the average and median total ridership %>% bike_data summarize(mean(riders_total), median(riders_total)) ## mean(riders_total) median(riders_total) ## 1 4504.349 4548
We can also calculate summaries by groups defined by other variables (columns).
# Calculate the average total ridership on weekends vs weekdays %>% bike_data group_by(weekend) %>% summarize(mean(riders_total)) ## # A tibble: 2 x 2 ## weekend `mean(riders_total)` ## <lgl> <dbl> ## 1 FALSE 4551. ## 2 TRUE 4390.