High-performance Spark

Spark is the reference Big Data framework. With a rich set of primitive operations and an optimized execution pipeline, it helps companies speed up their data analysis pipelines.

But after such a disruptive technological shift, companies are also suffering from a lack of skilled Spark programmers.

It isn’t difficult to learn the Spark primitives and write a few Spark programs. But when things go real, your program runs over terabytes of data and things start to fail...

What do you do? Convince your boss to call a Spark consultant? Simply panic?

In this course, we will use Spark for real, execute our programs on a cluster, optimize data pipelines, tweak configuration parameters and analyze failures. You will learn what it takes to become a seasoned Spark engineer. Your value in the market will increase exponentially.

What we will do

  • Spark / RDD API
  • Spark SQL / Datasets / Project Tungsten
  • Configuration
  • Memory (on-heap, off-heap)
  • Common tweaks
  • Current bugs and their workarounds
  • Optimization techniques
  • Dealing with lazy evaluation properly
  • Smart partitioning
  • Caching
  • Spark limitations and how to deal with them
  • On-cluster execution
  • Analysis
  • Debugging