David Anderson

David Anderson began his career as a research scientist at Carnegie Mellon University, Mitsubishi Electric Research Labs, and Sun Labs. A former CTO and Chief Architect, David has been leading the development of data intensive applications for companies and startups across Europe (e.g.agile methodologies, relational databases, distributed databases, recommendation engines). He is also responsible for User Education at data Artisans, helping users learn how to get the most out of Apache Flink.
David holds a Master's degree in Computer Science from Carnegie Mellon University. When not in Berlin, David can be found in the Provence region, France.

https://www.linkedin.com/in/alpinegizmo/

 

REAL-WORLD RECOMMENDER SYSTEMS

outcome

  • How recommenders work, using both content-based and collaborative filtering techniques.
  • How to build recommenders that scale. On platforms where both the number of users and/or items (such as movies, or products, or job openings) may be very large -- i.e. in the millions -- thinking about scaling is essential.
  • How to factor in business concerns: e.g. pricing, inventory, seasonality, new items, new users, popular products, serendipity, coverage, etc.
  • How to tune and evaluate a recommender.
  • How to generate recommendations in real-time.

Content

Recommendations are used in many industries, such as ecommerce, jobs, music, and social media. This course goes beyond the basics and emphasizes solutions to problems you will face when your business deploys a recommender system.

Pre-requisites

  • knowing at least one programming language, ideally python.

     

 

BIG DATA PROCESSING WITH SPARK

Learning outcome

  • Be able to understand how to put together a data pipeline
  • Be able to understand the bottlenecks of common data processing operations
  • Be able to deploy Spark to a cluster

Content

Apache Spark is a distributed computing system written in Scala and developed initially as a UC Berkeley research project for distributed data programming. It has grown in capabilities and it recently became a top-level Apache project. Spark is now the fastest-growing open source project in history. In this workshop, developers will use hands-on exercises to learn the principles of Spark programming. Prerequisites: knowing at least one programming language, ideally python.

INTRO TO BIG DATA

  • Hadoop
  • Hadoop ecosystem
  • HDFS
  • map-reduce
  • YARN
  • Zookeeper

WORKING WITH HDFS

  • Spark basics
  • RDDs
  • API
  • Spark data frames

SPARK SQL

  • Loading data
  • Simple queries

Pre-requisites

  • knowing at least one programming language, ideally python.