Data Science 101


Understand how data science can be used effectively in industry


Program proficiently in R


Visualize findings effectively


Build basic models and find patterns in data

Syllabus: Data Science 101

This five-day intensive training will turn employees into savvy data analysts with a solid foundation to tackle data cleaning, visualization, and basic modeling. Students will become comfortable with R, an open-source tool that is widely used by professional statisticians and analysts. R is designed to analyzing data powerfully and effectively, create predictive models, and build beautiful visualizations. This online component contains assessment during the training session, as well as additional resources, trainings, and support beyond the classroom.

By the end of this course, students will be able to:
  1. Understand how data science can be used effectively in industry
  2. Program proficiently in R
  3. Visualize findings effectively
  4. Build basic models and find patterns in data
  1. Concept reviews: these are comprised of  quizzes that cover the most important concepts and ideas in each lesson. They encourage holistic understanding and are multi-faceted ques=on types (i.e. drag and drop, fill-in-the-blanks, matching, etc).
  2. Exercises: these are additional videos that cover the coding functions in the instructional video in more depth. They are project-based and include coding templates for students to strengthen their skills outside of the course.
Materials provided:
  1. Accompanying workbooks to use as reference materials
  2. R code templates to implement as frameworks
  3. Data sets used in the instructional videos and exercises

Course Outline

1. Data science fundamentals:

What is data science?
A data scientist’s approach
Commercial applications of data science

2. Introduction to R programming:

Installing R and RStudio
Introduction to RStudio
Performing basic calculations in R
Loading data into R
Working with multiple data types
Data wrangling and cleaning in R

3. Basic visualizations:

Basic plotting in R
Basic plotting in ggplot2
Customizing graphs and adjusting formats

4. Introduction to clustering and unsupervised machine learning:

What is unsupervised machine learning?
Commercial applications of data mining

5. Implementation of clustering:

k-means clustering on multi-dimensional data
Evaluating the quality of clustering
Determining the right number of clusters to use

6. Clustering multiple data:

Working with binary data – cosine distance
Clustering binary data – spherical k-means
Assessing quality of spherical k-means clustering
Interpreting clusters of binary data and making recommendations
Pitfalls of clustering

7. Introduction to regression and supervised machine learning:

Commercial applications of regression

8. Basic statistics and regression modeling:

Linear relationships: slope, y-intercept, variable interactions
Variance and standard deviation
Covariance and correlation
Normal distribution and bell curves

9. Regression model evaluation:

Distribution of errors: Q-Q plot, heteroscedasticity
Multiple regression
R squared and adjusted R squared
p-values and t-test
F-test and F-distribution

10. Introduction to classification:

k-Nearest Neighbors
Decision trees: gini coefficient and information gain
Introduction to random forests
Confusion matrices, misclassification rates
Base line errors and lift

11. Pitfalls and best practices of data science:

Understanding the limits of your data
Checking data validity
Ethical considerations
Best practices for data analysts

Course Forum