1 Syllabus

1.1 Goals

  • Hone your statistical, data, and computing literacy.

  • Instead of covering all statistical modeling & inference techniques in 5 days (impossible!), focus on a couple of foundational & generalizable tools: linear regression & simple classification. In doing so, we’ll bypass topics in traditional stat intros.

  • Favor applications using real data over theory so that you walk away with a sophisticated set of tools with real applications.

  • Play around with the RStudio software. In doing so, focus on the patterns in & potential of this software. Don’t worry about memorizing syntax - this will come with experience. For example, by the end of the week you’ll likely be comfortable with ggplot() and lm() functions simply because we’ll use them a lot!

  • Do some messy stuff. Too often, stat and data science classes are taught with data that are nice and tidy. In the real world, data are messy and require cleaning/wrangling. As discussed in this New York Times article, “Data scientists…spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” Though data wrangling isn’t the focus of this bootcamp, it will be a useful and necessary part of it. Don’t worry / get too distracted by the extra coding this requires. The goal is simply for you to start recognizing the messiness of real world data and to build up confidence in dealing with it.





1.2 Schedule

The schedule is in flux and likely to change throughout the week.

  • Pre-bootcamp prep
    Get up and running with RStudio.


  • Day 1: How can we model/explain the variability in our sample data?
    Exploring and explaining variability using visualizations and regression models. Topics will include:
    • multivariate visualizations
    • multivariate regression models with covariate, categorical, interaction, and transformation terms
    • estimation of model parameters via least squares


  • Day 2
    • Warm-up: data wrangling
    • How do we select a model? How good is the model?
      Selecting a statistical model using subset selection & measuring model quality. Discussion will include:
      • residual analysis
      • \(R^2\) & mean squared error (MSE)
      • cross validation
      • overfitting
      • bias-variance trade-off


  • Day 3
    • Warm-up: More data wrangling
    • What does this sample model tell us about trends in the broader population?
      Using data from a sample to make inferences about a broader population of interest. Topics will include:
      • sampling distributions & the Central Limit Theorem
      • standard error
      • bootstrapping
      • prediction & confidence intervals


  • Day 4: What does this sample model tell us about trends in the population? How can we carefully interpret & communicate our conclusions?
    Continuing inferential techniques (hypothesis testing) and discussing common pitfalls in statistical analyses:
    • Simpson’s Paradox
    • multicollinearity
    • multiple testing
    • statistical vs practical significance
    • errors in hypothesis testing


  • Day 5: How can I navigate a new analysis outside the nice bootcamp setting? What are the first steps to take when starting a project or getting ahold of a new data set? What if our linear regression tools aren’t appropriate for this particular analysis?
    Topics might include:
    • the iterative process of “getting to know” a dataset, and trying to identify meaningful insights within it
    • exploratory data analysis
    • group mini-project
    • logistic regression for modeling categorical variables





1.3 Software

Statistical applications utilize data. Working with modern (large, messy) data sets requires statistical software. We’ll exclusively use RStudio. Why?

  • it’s free
  • it’s open source
  • it has a huge online community
  • it’s the industry standard
  • it can be used to create reproducible and elegant documents (eg: your bootcamp materials!)

To get started, take the following two steps in the given order. Further, if you already have R/RStudio, make sure to update to the most recent versions before the bootcamp.

  1. Download & install the latest version of R: https://mirror.las.iastate.edu/CRAN/

  2. Download & install the latest version of RStudio Desktop (Open Source License): https://www.rstudio.com/products/rstudio/download/
    Be sure to download the free version!!

What’s the difference between R and RStudio? Mainly, RStudio requires R – thus it does everything R does and more. We will be using RStudio exclusively. You’ll take a quick tour in your pre-bootcamp homework.