Daniel Voigt Godoy.png

Daniel Voigt Godoy


Daniel Voigt Godoy is a Data Scientist and Programmer with more than 15 years of experience in Brazil and in Germany in the financial sector, including private banking and public treasury. He has led developer teams with more than 20 members. His stack has included MatLab, R, Statistica, Python (numpy, scipy, scikit-learn, pandas), Java, Apache Spark, and Scala. While serving as data analyst at the state treasury (Rio Grande do Sul) he structured and analyzed datasets, leading to several papers that have won the Brazilian's National Treasury Award.
Daniel studied Computer Science and holds a Master's degree in Economics from Universidade Federal do Rio Grande do Sul, Porto Alegre. He is also an alumnus of Data Science Retreat (Batch 05).
https://www.linkedin.com/in/dvgodoy/

Applied Machine Learning using Apache Spark

Learning outcomes

After this course you will be able to:
- Load data into Spark and transform it
- Train models using Spark's Machine Learning libraries
- Make predictions and evaluate models
- Build a Machine Learning pipeline
- Tune model's parameters with GridSearch and Spark

Content

Apache Spark, a cluster computing framework, is one of the most popular open source projects in the world.
This hands-on course focus on applying different Machine Learning algorithms to datasets using Apache Spark's capabilities through the Spark Python API: PySpark.
The course is divided in two main parts: the first one is dedicated to the original Machine Learning library - MLlib, which is built on top of Resilient Distributed Datasets (RDDs); while the second one is focused on the spark.ml library, which is built on top of DataFrames.
For both parts, this course follows a exercise-driven approach: a new topic is introduced and then the students can practice it on a Jupyter Notebook especially designed to help them fixate the content. The exercises are linked together so, at the end of each part, the students will have gone through a whole Machine Learning pipeline.
The course is self-contained, but it assumes the students are able develop Python scripts and have been introduced to Apache Spark and Machine Learning algorithms.

Table of Contents:

1) Spark MLlib
- Vectors and Labeled Points, Local and Distributed Matrices
- Summary Statistics, Sampling and Hypothesis Testing
- Data Normalization and PCA for Feature Engineering
- Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods
- Evaluation

2) Spark.ml
- Built-in and external Data Sources
- Explode, User-Defined Functions and Pivot
- Statistics, Random Data Generation and Sampling on DataFrames
- Handling Missing Data and Imputing Values
- Transformers and Estimators
- Data Normalization, Feature Vectors, Categorical Features, PCA and R Formulas
- Pipelines
- Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods
- Evaluation
- Saving and Reloading Models

3) Additional topics
- Model Tuning: Cross Validation and Test-Validation Split in Spark
- Spark Scikit-Learn: SKLearn Grid Search with Spark

Pre-requisite

  • Solid use of python, use of apache spark and Machine learning algorithms.