Making Big Data Processing Simple with Spark with Matei Zaharia

As data volumes grow, we need programming tools for parallel applications that are as easy to use and versatile as those for single machines. The Spark project started at UC Berkeley to meet these goals. Spark is based on two main ideas. First, it has a language-integrated API in Python, Java, Scala and R, based on functional programming, that makes it easy to build applications out of functions to run on a cluster. Second, it offers a general engine that can support streaming, batch, and interactive computations, as well as advanced analytics such as machine learning, and lets users combine them in one program. Since its release in 2010, Spark has become a highly active open source project, with over 900 contributors and a broad set of built-in libraries. This talk will cover the main ideas behind the Spark programming model, and recent additions to the project.

Matei Zaharia

Matei Zaharia is an assistant professor of computer science at MIT and CTO of Databricks, the company commercializing Apache Spark. He started the Spark project during his Ph.D. work at UC Berkeley. He is broadly interested in large-scale computer systems and networks, and has also contributed to projects including Mesos, Hadoop, Tachyon and Shark. Matei received the ACM Best Doctoral Dissertation Award in 2014 for his research, as well as best paper awards at NSDI and SITCOM.