You are not allowed to perform this action
Big Data Processing and Analysis
Instructor: Mohammad Amin Fazli | Certificate: Official (bilingual) |
Term: Summer 2025 | Prerequisite: Programming for Data Analysis |
Schedule: – | Online Class: Online Class |
Objective
The objective of this course is to familiarize students with concepts and issues related to infrastructure for storing and managing big data. All concepts in this course will be taught tangibly and practically, and students will be required to set up the taught infrastructures and work with them hands-on. For better convergence, an actual technology has been selected for each infrastructure concept, which will be taught alongside the theoretical concepts. For ease of use and simpler integration, all technologies have been selected from the Apache and Hadoop stack.
Topics
- Practical Review of Operating Systems
- Practical Review of Databases
- Virtual Machines and Container Technology
- Operating System Concepts in Big Data Analysis
- Basic Concepts
- Hadoop Architecture
- Distributed File System and HDFS
- Distributed Computing and MapReduce
- Submitting MapReduce Jobs to YARN
- Workflows in Hadoop
- Hadoop Streaming
- Examples of MapReduce Programming with Python
- Advanced MapReduce
- In-Memory Computing and Spark
- Spark Concepts
- Using PySpark
- Implementing a Spark Application
- Data Warehouses and Data Mining
- Data Warehousing and Data Schemas
- Querying Structured Data with Hive
- Column-Oriented Databases and Real-Time Data Analysis with HBase
- Data Integration
- Importing Relational Data using Sqoop
- Importing Data Streams using Flume
- Data Analysis with Higher-Level APIs
- Introduction to Pig Technology
- Introduction to Higher-Level Spark APIs such as Spark SQL and DataFrames
- Introduction to Distributed Machine Learning with Spark
Assessment
- Exams: Midterm and Final Exams (50% of grade)
- Assignments & Project: Three theoretical assignments and one practical project to be submitted during the semester (50% of grade).
References
- Bengfort, Benjamin, and Jenny Kim. Data analytics with Hadoop: an introduction for data scientists. O'Reilly, 2016.