HHS CoLab Spring 2019
This course is part of the in-person / live-streaming delivery of the HHS CoLab program, and is not meant as a standalone, asynchronous course.
Course Objectives
- Build your skills in R
- Learn how to implement powerful data science methods
- Create a capstone project to move your team forward
Week 1: Course introduction, introduction to Git and fundamentals of R
1. Introduction to git and fundamentals of R (6/5)
a) Introduction to git
b) Basic shell commands
c) Introduction to R and R Studio
d) An overview of data science
e) Performing basic calculations in R
f) Loading data into R
2. Fundamentals of R (6/8)
a) Understanding data types, how and when to use them
b) Read/write data
c) Evaluate and address missing values in data
d) Manipulate data types and structures using flow control structures
Week 2: Fundamentals of R II and Static and interactive visualization in R
3. Fundamentals of R II (6/12)
a) Transforming and cleaning data using tidyverse’s dplyr package
b) Selecting and subsetting data using dplyr
c) Summarizing and aggregating data using tidyverse’s tidyr package
4. Intro to vis and base r (6/14)
a) Basic plotting in R
b) Introduction to the ‘grammar of graphics’ structure
c) Basic plotting in ggplot2
d) Customizing graphs and adjusting formats
Week 3: Introduction to foundational statistics and regression
5. Advanced ggplot, interactive vis – highcharts (6/19)
a) Advanced plotting in ggplot2, incorporating many variables
b) Working with other libraries (i.e highcharts) for interactive visualization
c) Telling a story through data and visualizations
6. Introduction to foundational statistics (6/21)
a) Basic statistics
b) Expected value/standard deviation/variance/covariance
c) Statistical tests and significance
d) Linear regression
e) Single variable regression
f) Multiple regression
Week 4: Introduction to foundational statistics and regression
7. Best practices for model building (6/26)
a) Introduction to the model building process
b) Splitting data into train/test/validation sets
c) Multiple regression – dealing with correlated predictors
d) Predicting
8. Clustering (6/28)
a) Unsupervised vs. supervised learning
b) Introduction to clustering
c) K-means
d) Pitfalls of clustering
— Break – week of 7/2 – Holiday week (Fourth of July) —
Week 5: Principal Component Analysis (PCA) and Capstone presentations
9. Principal Component Analysis (7/10)
a) Introduction to feature selection and engineering
b) Curse of dimensionality
c) Introduction to PCA and other related techniques
10. Midterm capstone outlines and ideas presented (7/13)
a) Capstone ideas presented
b) Introduction to README.md files
c) Findings so far
d) Data being used
e) Cleansing steps
f) Steps to be completed in the next 4 weeks
Week 6: Introduction to working with text in R and text mining
11. Processing and working with text in R (7/17)
a) Introduction and first steps
b) Working with word counts
c) Text cleaning and pre-processing
12. Text mining in R (7/19)
a) Applications of text mining at scale
b) Word distribution in a corpus and its applications
Week 7: Advanced text mining and introduction to classification
13. Text mining in R – continued (7/26)
a) Summary metrics of corpora
b) Visualizing text data
14. Introduction to classification and k-nearest neighbors (7/27)
a) Acknowledge the difference between classification and regression
b) Introduction to kNN
c) Overview of classification performance metrics
Week 8: Supervised learning methods – Classification
15. Logistic regression (7/31)
a) Introduction to logistic regression
b) Data transformation for logistic regression
c) Prediction and measuring error using logistic regression
16. Decision trees / Random Forests (8/2)
a) Introduction to decision trees
b) Performance metrics for decision trees
c) Introduction to ensemble methods
d) Visualizing and presenting outcomes of a random forest
Week 9: Presentation
Final Presentations – Capstones (8/7)