Applied Machine Learning using Apache Spark

Apache Spark, a cluster computing framework, is one of the most popular open source projects in the world. This hands-on course focuses on applying different Machine Learning algorithms to datasets using Apache Spark's capabilities through the Spark Python API: PySpark. The course is divided in two main parts: the first one is dedicated to the original Machine Learning library - MLlib, which is built on top of Resilient Distributed Datasets (RDDs); while the second one is focused on the library, which is built on top of DataFrames. For both parts, this course follows a exercise-driven approach: a new topic is introduced and then the students can practice it on a Jupyter Notebook especially designed to help them fixate the content. The exercises are linked together so, at the end of each part, the students will have gone through a whole Machine Learning pipeline. The course is self-contained, but it assumes the students are able develop Python scripts and have been introduced to Apache Spark and Machine Learning algorithms.

Table of Contents:

1. Spark MLlib

  • Vectors and Labeled Points, Local and Distributed Matrices 
  • Summary Statistics, Sampling and Hypothesis Testing  
  • Data Normalization and PCA for Feature Engineering  
  • Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods  
  • Evaluation 


  • Built-in and external Data Sources  
  • Explode, User-Defined Functions and Pivot  
  • Statistics, Random Data Generation and Sampling on DataFrames  
  • Handling Missing Data and Imputing Values  
  • Transformers and Estimators  
  • Data Normalization, Feature Vectors, Categorical Features, PCA and R Formulas  
  • Pipelines  
  • Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods  
  • Evaluation  
  • Saving and Reloading Models

3. Additional topics

  • Model Tuning: Cross Validation and Test-Validation Split in Spark  
  • Spark Scikit-Learn: SKLearn Grid Search with Spark 

After this course, you will be able to:

  • Load data into Spark and transform it 
  • Train models using Spark's Machine Learning libraries 
  • Make predictions and evaluate models 
  • Build a Machine Learning pipeline 
  • Tune model's parameters with GridSearch and Spark