When Cloudera became the first vendor to ship and support Apache Spark in February 2014, Spark was already well on its way toward becoming the framework of choice for faster batch processing, machine learning, advanced analytics, and stream processing. Today many Cloudera customers have begun moving these workloads from MapReduce to Spark in their production systems, and the trend is accelerating.
Cloudera’s philosophy is that customer success is only achieved by directly contributing to the full spectrum of Apache Hadoop ecosystem projects. During this process, we tap deep expertise across our organization and gain the ability to drive the roadmap in directions important to the organizations we support. Toward those goals, six Clouderans contribute to the Spark project full time. Along with an even wider set of contributors from our training, QA, and field functions, Cloudera employees have seen more than 100 patches committed upstream. In addition, the Cloudera Spark team works closely with an Intel Spark team of 20 engineers to prioritize and deliver a combined roadmap.
We have already met significant goals toward enterprise hardening, improved operational management, and expansion of use cases. Some specific focus areas have included:
Making Spark-on-YARN Production Ready
Cloudera engineers have driven stabilization (SPARK-1011) of the integration between Spark and YARN. These patches enable Spark to run in a multi-framework environment with MapReduce and Impala, as well as dynamic pooling of resources across all frameworks.
Stronger integration between Spark and HDFS caching
Cloudera engineers are driving Spark integration with HDFS caching (SPARK-1767) by making Spark’s scheduling decisions aware of the locations of pinned data. This allows multiple tenants and processing frameworks to share the same in-memory data
More robust integration between Spark Streaming and Apache Flume
Cloudera engineers have upgraded the Spark Streaming’s Flume receiver from a push to a pull model. This eliminates the corner cases that could result in data loss in pipelines where data ingested through Flume feeds into Spark Streaming.
A standardized interface for launching Spark applications
Cloudera engineers originated spark-submit, the now standard method for launching Spark applications that abstracts across all of Spark’s supported deploy modes and cluster managers.
Based on our unique experiences in making organizations successful with Spark, our roadmap includes a variety of projects, many already underway, to both improve Spark and also better integrate it with the Hadoop ecosystem. Near term efforts include:
Bringing Spark to Apache Hive as an execution engine
Along with Databricks, IBM, Intel, and MapR, Cloudera was a co-founder of the effort to bring the Spark execution engine to Hive for faster ETL/batch processing. This initiative is in process upstream (HIVE-7292) with an expected completion timeframe of Q1 2015.
Synthesizing configuration and metrics data to provide higher-level tuning and debugging information
Drawing on our experience in tracking down customer issues, SPARK-3682 seeks to build a centralized UI that turns the flood of fine-grained data provided by the Spark execution engine into actionable recommendations for improving performance and preventing failures.
Lossless Spark Streaming
SPARK-3129 seeks to bring lossless data recovery to the Spark Streaming driver, which will avoid the possibility of data loss when it goes down.
Integration with YARN Application Timeline Server
Integrating Spark with YARN ATS (SPARK-1537) will allow users to access job history from a central location, without the need for a separate server.
Reporting metrics incrementally
A common pain point for our customers has been lack of insight into the behavior of running tasks. A set of improvements will enable tasks to report metrics before they have completed.
Dynamic resource management through finer-grained integration with YARN
Spark applications currently grab static chunks of resources from YARN at startup. Two initiatives will allow Spark applications to both be better citizens within multi-tenant environments and to scale dynamically with changing resource requirements over their lifetimes. SPARK-3174, a joint effort with Databricks, allows Spark apps to dynamically add and remove executors. YARN-1197 and associated work on the Spark side will enable requesting and releasing resources within executors.
A Matter of Time
We’re excited to see near universal adoption of Apache Spark among Hadoop distributions, and alignment around Cloudera’s vision of Spark as the successor to MapReduce as the general execution engine for data processing on Hadoop. Combined with the work Cloudera is doing to move additional workloads such as Hive, Pig, Solr, Crunch, and others to Spark, more and more users can expect to soon gain the advantages Apache Spark brings to the enterprise.