Apache Spark – Welcome to the CDH family

The neatest part about being part of our market is the rapid rate of innovation we experience. Ideas from a variety of sources – industry, academia and sometimes industry spawned from academia (in the case of our partner Databricks) – regularly become mainstream and create net new ways to interact with and analyze our data sets. One such innovation that we at Cloudera have been excited by for some time now is Apache Spark.

Combining best of breed ideas from prior systems with unique innovations to leverage memory effectively, Spark has a rich academic pedigree (see the Spark: Cluster Computing with Working Sets and Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing papers for more information.) But the project has also been under active development for more than three years. At this point, Spark is mature to the point where it is being used daily as a production system for many companies ranging in size from small startups to large technology firms with deep experience in big data problems.

We are proud to announce that Apache Spark is the latest framework to be added to the CDH platform! Based on the recently released Spark 0.9, CDH 4.4 and onwards will support and run Spark (get installation instructions here).

The technology behind Spark

At its core, Spark brings a few key properties to the table:

  • Cluster-wide memory usage: By leveraging memory intelligently to store working sets, Spark is able to generate several orders of magnitude performance improvements to traditional MapReduce jobs for certain classes of problems.
  • Complex execution graphs: The Spark execution engine can accept a user program which is converted into a complex, multi-stage execution graph that the engine can intelligently parallelize. Not only that, Spark is also able to recognize when multiple stages (or iterations in the case of certain Machine Learning type problems) can reuse the same data sets and ensure that they remain in memory without unnecessary disk operations.
  • Programming model: Besides technical improvements to the execution engine, the Spark project has also focused on a simplified programming model that exposes user friendly APIs in multiple languages (Java, Scala, and Python.) Writing a declarative program and letting the execution engine figure out how to parallelize can be a very liberating experience! Add in interactive shells that allow developers to prototype quickly, and Spark can be a sheer joy to work with.

For more context on the technology, we heartily recommend the following articles from prior posts on the Cloudera blog – describing the technical architecture of Spark and describing the motivations behind our adoption of Spark.

CEDH

Spark as part of an enterprise data hub

Spark enables big improvements to several classes of workloads running on a CDH cluster:

Machine Learning

Machine learning algorithms which tend to have iterative computations on the same data sets can get particularly large wins from Spark adoption. By simply adding a simple command to cache data in memory, algorithms work at memory speeds when iterating over working data sets, thus giving orders of magnitude improvement to performance.

Spark Streaming

A particularly interesting add-on to Spark is Spark Streaming which integrates with external data ingest mechanisms and provides a way to perform analytics as data is ingested into the cluster using a micro-batch architecture. This allows for streaming computations that can be used to draw insight and take action within seconds of data being generated. Using the same architectural model as Spark means that the same program could be used with almost no modifications in both a batch and a streaming context!

Faster Batch

Even for traditional batch analytics where MapReduce has been the ruler of the roost for so long, Spark can give speed boosts – where traditional MR pipelines have been bound by the necessity to read and write from disk between jobs, Spark can parallelize across jobs and not do unnecessary IO. Also, when working sets fit in memory, batch jobs get an appropriate boost.

As such, more and more customers are turning to Spark as their system of choice for doing traditional MapReduce style analytics for pipelined batch jobs.

Integration with CDH

With the first release of Spark on CDH, we have added Spark parcels that can be easily installed onto the cluster and upgraded by Cloudera Manager with the push of a button. As can be expected, we have big plans for further work on Spark, both at the platform and at the management layer.

While Spark is initially bundled as a separate download, we will be making it a native part of the CDH distribution in the short-term, one that is integrated deeply into the platform as well as the management stack for even better management and control.

Our partnership with Databricks

All of this has been made possible by the hard work of the Spark community, with contributors from all over the world. Having said that, we would like to particularly thank our partners at Databricks, many of whose employees created and developed the Spark project. We are looking forward to a continued close relationship with the team as we jointly bring Spark to the wider marketplace.

Ready?

You can find links to the download, documentation and installation information here. We hope you try it out, give us your feedback and get involved in its evolution.

 

Share:



Filed under: General