Curriculum: Machine learning

 

Introduction to Python

In this session, the programming language Python is introduced. Important concepts for programming in the context of Data Science are reiterated. The development environment (e.g. Jupyter notebook, Python libraries) is set up. We make sure all participants are on board.

Mathematical Refresher

A short refresher of important mathematic concepts, mainly regarding linear algebra, gradient descent, and analysis (differentiation). The part serves shall further the intuitive understanding of the mathematics involved and list the most important mathematical facts useful for machine learning.

Anaconda as a Data Science environment

In this session, the Python distribution “Anaconda” is introduced and the most important libraries for Data Science related tasks are introduced. We demonstrate the industrial strength use of the mathematical and data analysis libraries provided.

Supervised Learning I – Decision Tree Models

This block introduces the supervised learning problem of Machine Learning. The decision tree family of models including random forests are introduced. In a practical example, it is shown how to work with the provided machine learning libraries to solve a simple classification task.

Data Preparation

This session shows participants how to prepare data for Machine Learning tasks. Plenty of applied tricks are mentioned, which are usually not easily found in textbooks.

Data Analysis

This block contains the necessary statistics ‘equipment’. Basic notions are reiterated; the most important probability distributions are shown with visually attractive examples, and hypothesis testing is discussed. Typical pitfalls of statistics are discussed – especially in the big data context – and a quick overview of Bayesian statistics is provided.

Model Evaluation

This block answers the question how the ‘goodness’ or adequacy of statistical models can be measured. Experimental settings to ensure generalization strength of Machine Learning algorithms are intensively discussed. Furthermore, statistical resampling techniques commonly used in ML are shown.

Supervised Learning II

More families to tackle the supervised learning problem are provided. First, the linear model family is shown, these include linear and logistic regression. Other ways to extend these important models are indicated. Second, the rich class of Support Vector Machines (SVM) are shown, and an intuitive explanation of the difference between linear and non-linear SVMs are taught. The class ends with an outlook onto kernel methods.

Neural Networks

Neural networks are presented as an introduction to Deep Learning. The classical construction of neural networks is explained, and first nets are trained. Relevant techniques such as regularization and parameter sharing are explained. 

Deep Learning I

Advanced concepts of optimization are explained to enable the training of deeper architectures for neural networks. As advanced type of model, convolutional networks are demonstrated, and advanced architectures / use-cases are iterated. 

Deep Learning II

The family of recurrent neural networks are introduced and LSTMs (networks with long-term memory) are demonstrated.

Dimensionality Reduction and Feature Selection

This class introduces techniques to reduce the dimension of data. These include methods such as PCA and t-SNE. The de-noising effect and their usage in the visualization of high-dimensional data is explained. In the second part, some of the techniques of feature selection are introduced.

Unsupervised Learning

The problem of unsupervised learning is introduced. In this problem setting we find ‘structure’ in a set of given data. Important cluster algorithms such as k-Means and DBScan are introduced. Advanced techniques such as hierarchical clustering and cluster evaluation methods are discussed. Gaussian mixture models and the EM (expectation maximization) are studied. Graphical models are shown as a short outlook.

Big Data

The aim of this class is to introduce 'post Hadoop' Big data technologies, particularly Spark. In an ideal world, we would be able to apply all the techniques we saw before on any dataset, no matter how large. But it takes quite a bit of effort to do even the most basic operations. The challenge is getting programs on multiple machines to work together in an efficient way.

Visualization

The aim of this class is to introduce visualization techniques and the major visualization libraries in Python. How do you visualize large datasets? How do you prepare live visualizations that react to user input instead? We demonstrate how to tidy up your data before presentation.

Reinforcement learning I

The aim is to introduce reinforcement learning as a computational approach to learn from interaction. We learn a function that assigns rewards to every action. This is a complex problem because there is a credit assignment problem: When several actions preceded a result, how do we know which action caused it? 

Reinforcement learning II

We take into account not only the immediate rewards produced by an action, but also the delayed rewards that may be returned several time steps deeper in the sequence.

Data security and Privacy Law

This module introduces basic notions of data anonymization techniques and cases of de-anonymization. Relevant parts of the German privacy law are shown. The main goal is to sensitive Data Scientists to the legal implications of their work to reduce legal risks.