Big data processing with Spark

Apache Spark is a distributed computing system written in Scala and developed initially as a UC Berkeley research project for distributed data programming. It has grown in capabilities and it recently became a top-level Apache project. Spark is now the fastest-growing open source project in history. In this workshop, developers will use hands-on exercises to learn the principles of Spark programming. Prerequisites: knowing at least one programming language, ideally python.

Intro to Big Data 

  • Hadoop
  • Hadoop ecosystem
  • HDFS
  • map-reduce
  • YARN
  • Zookeeper

Working with HDFS 

  • Spark basics
  • RDDs
  • API
  • Spark data frames

Spark SQL

  • Loading data
  • Simple queries

After participating in this workshop you should:

  • Be able to understand how to put together a data pipeline
  • Be able to understand the bottlenecks of common data processing operations
  • Be able to deploy Spark to a cluster