star twitter facebook envelope linkedin instagram youtube alert-red alert home left-quote chevron hamburger minus plus search triangle x


Data Mining


Units: 6


Data mining is the science of discovering structure and making predictions in large, complex data sets. Nowadays, almost every organization collects data, which they hope to use to support improved decision-making. Learning from data can enable us to better: detect fraud, make accurate medical diagnoses, monitor the reliability of a system, perform market segmentation, improve the success of marketing campaigns, and much, much more. 


This course serves as an introduction to Data Mining for students in Business and Data Analytics. Students will learn about many commonly used methods for predictive and descriptive analytics tasks. They will also learn to assess the methods' predictive and practical utility.

Learning Outcomes

By the end of this class students will learn:

  1. Be able to produce, comprehend and run Python code for commonly used data mining methods.
  2. Understand the advantages and disadvantages of multiple data mining methods. This involves:
    1. Generalizability
    2. Bias-variance trade-off
    3. Interpretability-flexibility tradeoff
  3. Be able to compare the utility of different methods through lab exercises, homeworks, and a final project.
  4. Understand the concepts behind feature engineering, and be able to place them into practice through different types of data.
  5. Be able to choose an appropriate model/s for a dataset and evaluate the performance and reliability of such model/s.
  6. Be able to apply methods to real-world data.

Prerequisites Description

Prior to this course, students should have taken:
95-888 Data Focused Python or 90-819 Intermediate Programming with Python

Topics students should be familiar with (programming):
-    List comprehensions and dictionaries
-    Data types
-    Basic text processing ( reading files, slicing data, dealing with data frames),
-    Functions (classes are also useful but not required)
-    Generators and iterators are useful
-    Mapping / zipping/unzipping
-    Loops and conditionals
-    Lambas (useful but not required)
Specific to python: experience with python 3 (and above) and be familiar with libraries such as Pandas, NumPy, SciPy, Matplotlib, Seaborn or Plotly

Preferably students have also taken a statistics course such as 90-707, 90-711, or 90-777

Many of the algorithms use for Data Mining require knowledge of distributions, p-values, metrics, probability theory, dimensionality reduction, sampling, Bayesian statistics, among other concepts.