16 Exploratory Data Analysis (EDA)
SETTLING IN
There are no notes to open for today!
Chat in your groups! After today, we’ll be focusing on the course project. You’ll work in groups of 3-4 on these projects. Each group will pick and analyze their own dataset. The people you’re sitting with today are NOT necessarily your project groups! BUT let’s practice some brainstorming and get to know what other people are thinking about. Specifically, share the following with each other. And don’t think too hard! Just share what’s at the top of mind today.
- What is your major / minor / concentration, declared or intended?
- What are some personal hobbies or passions or things you’ve been thinking about or things you’d like to learn more about?
Spring 2025 registration info
Registration is coming up! Courses to consider:
- More data science
- COMP/STAT 212 (Intermediate Data Science)
- Prereqs: 112, STAT 155, COMP 123
- Recommended: STAT 253
- Topics: similar themes to 112 but more advanced / in depth approaches
- COMP/STAT 212 (Intermediate Data Science)
- More data analysis + statistical foundations in Data Science
- STAT 155 (Intro to Statistical Modeling)
- Prereqs: none
- Postreqs: 155 is required for all STAT courses beyond the 100-level
- Topics: Like 112, you’ll use data to explore relationships of interest. But unlike 112 in which this exploration is observational and restricted to lower dimensions, 155 explores how to model relationships and use these models to make inferences and predictions regarding the population outside our dataset.
- Overlap: 155 uses similar wrangling and viz tools as 112, but these are not the emphasis.
- STAT 253 (Statistical Machine Learning)
- Prereqs: STAT 155
- Topics: Like 112 and 155, 253 focuses on data analysis. It surveys a wide variety algorithms / models, beyond those in 155, thus greatly expands the types of relationships we can study.
- STAT 155 (Intro to Statistical Modeling)
- More computing + computational foundations in Data Science
- COMP 123 (Core concepts in computer science)
- Prereqs: none
- Postreqs: 123 is core requirement for all COMP courses
- Topics: This course focuses on core concepts in computer science, not on data.
- COMP 123 (Core concepts in computer science)
- Mathematical foundations in Data Science
MATH 236 (Linear Algebra)
- Prereqs: MATH 279 or MATH 137 or STAT 155.
- Postreqs: Linear Algebra is a core course in the Data Science major, and an important pre-req for upper level COMP/MATH/STAT courses in the Data Science major.
- Topics: This course focuses on concepts in linear algebra, not data.
MATH 137 & 237 (Applied Multivariate Calculus II and III)
- Prereqs: Review the course placement page.
- Postreqs: Calc is an important pre-req for upper level MATH/STAT courses in the Data Science major.
- Topics: This course focuses on concepts in calculus, not data.
Other resources - waitlist info for all MSCS classes - MSCS registration ice cream social: - Thursday, 11:15am - 12:30pm in the OlRi Smail Gallery (atrium on main floor of OlRi). - Stop by to learn about MSCS spring courses, MSCS majors/minors, etc.
- Understand the first steps that should be taken when you encounter a new data set
- Develop comfort in knowing how to explore data to understand it
- Develop comfort in formulating research questions
Read:
- Exploratory Data Analysis (Wickham, Çetinkaya-Rundel, & Grolemund)
- Exploratory Data Analysis Checklist (Peng)
WHERE ARE WE?!? Starting a data project
This final, short unit will help prepare us as we launch into course projects. In order to even start these projects, we need some sense of the following:
data import: how to find data, store data, load data into RStudio, and do some preliminary data checks & cleaning
exploratory data analysis (EDA)
16.1 Warm-up
What is EDA?!
EDA is a preliminary, exploratory, and iterative analysis of our data relative to our general research questions of interest.

How is this different than what we’ve been doing?
We’ve been focusing on various tools needed for various steps within an EDA. Now we’ll bring them all together in a more cohesive process.
EXAMPLE
EDA essentials
Start small.
We often start with lots of data – some of it useful, some of it not. To start:- Focus on just a small set of variables of interest.
- Break down your research question into smaller pieces.
- Obtain the most simple numerical & visual summaries that are relevant to your research questions.
Ask questions.
We typically start a data analysis with at least some general research questions in mind. In obtaining numerical and graphical summaries that provide insight into these questions, we must ask:- what questions do these summaries answer?
- what questions don’t these summaries answer?
- what’s surprising or interesting here?
- what follow-up questions do these summaries provoke?
Play! Be creative. Don’t lock yourself into a rigid idea of what should happen.
Repeat.
Repeat this iterative questioning and analysis process as necessary, letting our reflections on the previous questions inspire our next steps.
16.2 Exercises
Do the Homework 7 exercises.
16.3 Wrap-up
Upcoming events / due dates
Today: Voting day! Mobilize Mac has lots of info and opportunities to engage.
Tomorrow: Wednesday 11/6
- 11:59pm: Project Milestone 1
- This is required, and is the first “milestone” that will go toward your project grade. There are no extensions – it’s important prep for Thursday’s class.
- You can find this on Moodle & course schedule in the manual.
- 11:59pm: Project Milestone 1
Thursday 11/7
- 1:20pm: Quiz 2 revisions. Carefully review the instructions at the Quiz 2 link on Moodle in order to earn back as many points as possible.
- 11:59pm: Homework 7.
- In class, we’ll do some project brainstorming and start thinking about project groups. Attendance is important! Roughly half the class will be work time for Homework 7