Spark is the reference Big Data framework. With a rich set of primitive operations and an optimized execution pipeline, it helps companies speed up their data analysis pipelines.
But after such a disruptive technological shift, companies are also suffering from a lack of skilled Spark programmers.
It isn’t difficult to learn the Spark primitives and write a few Spark programs. But when things go real, your program runs over terabytes of data and things start to fail...
What do you do? Convince your boss to call a Spark consultant? Simply panic?
In this course, we will use Spark for real, execute our programs on a cluster, optimize data pipelines, tweak configuration parameters and analyze failures. You will learn what it takes to become a seasoned Spark engineer. Your value in the market will increase exponentially.
What we will do
- Spark / RDD API
- Spark SQL / Datasets / Project Tungsten
- Memory (on-heap, off-heap)
- Common tweaks
- Current bugs and their workarounds
- Optimization techniques
- Dealing with lazy evaluation properly
- Smart partitioning
- Spark limitations and how to deal with them
- On-cluster execution