Gerrit Gruben

Gerrit Gruben combines broad capabilities in Machine Learning and Big Data strategies, e.g. from implementing a scalable tracking system as a microservice deploying on AWS to using Apache Spark and Scala for practical machine learning tasks. He worked for Publicis Pixelpark to automate digital marketing, and then joined the fashion group New Yorker to build-up the office in Berlin as a Sr. Data Scientist solving classical prediction problems such as demand prediction and optimization of inventory and logistics.
Gerrit holds a double degree in Computer Science and in Mathematics from the Free University of Berlin. He is the organizer of the Berlin Kaggle group.



You achieve competence in streaming machine learning.


This course introduces modern big data architectures, such as SMACK and KIKS, to perform real-time stream processing with a focus on Apache Spark and its machine learning capabilities. For making guarantees on the throughput and latency a look at technology in-depth is required. Typical scenarios of applying machine learning to real-time streams are discussed, such as adapting to trends with streaming linear regression or adaptive clustering of tweets as they arrive. As a major key technology, Kafka is discussed and worked with in-depth.

  • Select and judge architectures to deal with specific kind of big data problems;
  • Train models on streaming datasets;
  • Apply models to incoming streaming data;
  • Use Spark's structured streaming to work on streams like with data frames;
  • Work with Kafka and how to integrate a producer & consumer eco-system with microservices;
  • Deploy ML applications in a streaming setting.


  • Use of python and machine learning techniques