Big Data Processing and Analysis

Instructor: Mohammad Amin Fazli Certificate: Official (bilingual)
Term: Summer 2025 Prerequisite: Programming for Data Analysis
Schedule: – Online Class: Online Class

Objective

The objective of this course is to familiarize students with concepts and issues related to infrastructure for storing and managing big data. All concepts in this course will be taught tangibly and practically, and students will be required to set up the taught infrastructures and work with them hands-on. For better convergence, an actual technology has been selected for each infrastructure concept, which will be taught alongside the theoretical concepts. For ease of use and simpler integration, all technologies have been selected from the Apache and Hadoop stack.

Topics

  1. Practical Review of Operating Systems
  2. Practical Review of Databases
  3. Virtual Machines and Container Technology
  4. Operating System Concepts in Big Data Analysis
    • Basic Concepts
    • Hadoop Architecture
    • Distributed File System and HDFS
    • Distributed Computing and MapReduce
    • Submitting MapReduce Jobs to YARN
  5. Workflows in Hadoop
    • Hadoop Streaming
    • Examples of MapReduce Programming with Python
    • Advanced MapReduce
  6. In-Memory Computing and Spark
    • Spark Concepts
    • Using PySpark
    • Implementing a Spark Application
  7. Data Warehouses and Data Mining
    • Data Warehousing and Data Schemas
    • Querying Structured Data with Hive
    • Column-Oriented Databases and Real-Time Data Analysis with HBase
  8. Data Integration
    • Importing Relational Data using Sqoop
    • Importing Data Streams using Flume
  9. Data Analysis with Higher-Level APIs
    • Introduction to Pig Technology
    • Introduction to Higher-Level Spark APIs such as Spark SQL and DataFrames
  10. Introduction to Distributed Machine Learning with Spark

Assessment

  • Exams: Midterm and Final Exams (50% of grade)
  • Assignments & Project: Three theoretical assignments and one practical project to be submitted during the semester (50% of grade).

References

  1. Bengfort, Benjamin, and Jenny Kim. Data analytics with Hadoop: an introduction for data scientists. O'Reilly, 2016.