You are not allowed to perform this action

Principles and Techniques in Data Science

Instructor: Mahdi Dolati Certificate: Official (bilingual)
Term: Summer 2025 Prerequisite: Python Programming
Schedule: Sunday and Tuesday 16:30-18:00 Online Class: Online Class

General Objective

The objective of this course is to empower students to provide data-driven solutions for various problems. For this purpose, students will become familiar with the mathematical and statistical prerequisites for such approaches, learn the principles and steps of data-driven solutions including data analysis and visualization, statistical and probabilistic modeling, statistical inference, and decision-making under uncertainty. Through practical application of these methods to real-world problems, students will become familiar with the challenges of implementing these methods in practice.

Topics

  1. Data Analysis
    • Introduction to the data science lifecycle
    • Data generation (questionnaires, census, controlled experiments)
    • Data collection and aggregation (data standardization, tabular data representation, filtering and aggregating data)
    • Data cleaning (outlier management, missing values, encoding and vector space representation)
    • Exploratory data analysis
    • Data visualization
    • Pattern recognition and hypothesis generation through data visualization
    • Understanding pitfalls in data analysis (data bias, insufficient features, confusing correlation with causation)
    • Hypothesis testing and p-value manipulation
  2. Statistical Data Modeling
    • Introduction to modeling steps (cost function, parameter learning, prediction, decision theory)
    • Model generalization capability and its evaluation using cost functions
    • Training, validation, and test data separation
    • Overfitting, cross-validation, regularization
    • Optimization methods (gradient descent, Newton's method, momentum-based methods)
    • Probabilistic and Bayesian modeling
    • Statistical inference, model learning using estimation theory, prediction using trained models
    • Decision theory
    • Bias-variance tradeoff
    • Curse of dimensionality
  3. Statistical Modeling in Practice
    • High-dimensional data visualization using t-SNE
    • Feature extraction and selection
    • Feature quantization using decision trees
    • Linear classification methods
    • Classification using decision trees
    • Classifier evaluation
  4. Machine Learning Engineering in Production
    • Introduction to MLOps: end-to-end learning, continuous learning, data drift, concept drift, feature store, pipelines
    • Data lifecycle in production environments
    • Learning lifecycles and pipelines in production environments
    • Deployment of learning systems in production environments

Assessment

  • Exams: Midterm and final exams (50% of grade)
  • Assignments and Project: Three theoretical assignments and one practical project to be submitted during the semester (50% of grade)

References

  1. Principles and Techniques of Data Science, UC Berkeley, Fall 2022.
  2. J. Grus, Data Science from Scratch, O’Reilly, 2019.
  3. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, 2017.
  4. C. O'Neil, R. Schutt, Doing Data Science, O’Reilly, 2013.
  5. W. McKinney, Python for Data Analysis, O’Reilly, 2012.