Back in late 2013, I wrote about the rise of Apache Spark in the Apache Hadoop ecosystem. MapReduce had been hugely successful for many years as the data processing and transformation workhorse in Hadoop, but it had problems: MapReduce was hard to program, and it had some design flaws that meant it could never handle real-time or interactive workloads. Spark, created at UC Berkeley’s AMPLab in 2009, was designed expressly to address those shortcomings. It learned from both the success and the limitations of MapReduce.
By the time of my 2013 post, we had already added Spark to our commercial Hadoop-based offering. We believed, by then, that it would supplant MapReduce for general-purpose workloads on the Hadoop platform. We were the first vendor in the market to make that bet, but the others have followed. Spark is now distributed by all of the Hadoop vendors.
Spark catches fire
Since we first shipped Spark as part of CDH nearly two years ago, we’ve seen two exciting, and important, trends.
First, adoption is exploding. Cloudera has more than 150 customers using Spark in production today. User demand is high, and we’re fulfilling that demand at a furious pace. Our deployments are, in some cases, brand new to big data, but far more of them are using Spark as a complement to the rest of the Cloudera platform.
Second, Spark is maturing. That’s driven substantially by the activity of the global open source developer community — Spark is the most active Apache Software Foundation project ever. In the last six months, Spark is (by JIRA count) fifty percent more active than the core Hadoop project itself. All of those programmers are doing important work. Cloudera’s been an enthusiastic participant: we have more Spark committers on staff than all the other Hadoop vendors combined.
All that work is good, because enterprise users need a more mature platform. As great as the project was when it first emerged from Berkeley, it was research software, used mostly to test and validate academic ideas by PhD students. Stability, reliability, security, integration with commercial storage systems and more still needed work. Those gaps aren’t closed entirely, but we’ve made huge progress as a community.
Today, Cloudera’s customers use Spark for portfolio valuation, tracking fraud and handling compliance in finance, delivering expert recommendations to doctors in healthcare, analyzing risk more precisely in insurance and more. Spark is the fastest-growing single component in our platform, when we count the number of customers adding or expanding big data use cases.
Our 2013 bet is paying off.
The One Platform Initiative
Given that success, we’re doubling down on Spark. We invested earliest, and we’ve invested most, in making Hadoop enterprise-grade. Today, we’re announcing a new initiative to advance Spark in the same way, with the support and participation of the global Hadoop community. We call it the One Platform Initiative to highlight the fact that Spark is a key part of current and future Hadoop — and that the Hadoop ecosystem offers rich services which Spark, on its own, lacks. We need them more deeply integrated to deliver maximum return on big data.
We’re investing in key areas of development and innovation in Spark — filling in gaps and extending the project. We’ll improve integration with enterprise-grade capabilities in the broader Hadoop platform, so that Spark applications and users get full benefit of their big data infrastructure.
We’ll concentrate on four key areas: security, scale, management and streaming.
Strong security is an absolute requirement for many enterprises planning to use Spark. It must be able to guarantee data privacy, grant and revoke access privileges correctly, track data access accurately and report reliably when regulators ask.
We’ve invested substantially in securing Spark already (see SPARK-6017 and SPARK-5682) — as I said above, we have customers in regulated industry running Spark in production today. There’s more that we need to do, though, including:
encryption of data at rest, including RDDs (PDF link) that spill to disk and shuffle data;
encryption of data during transmission over a network; and
secure access to the WebUI.
For efficient encryption with low performance overhead, and thanks to our unique relationship with Intel, we’ll also take advantage of Intel’s Advanced Encryption libraries. Those libraries use arithmetic support on the chip to speed up computation dramatically.
Spark SQL is often used to pull data from Hive tables. We’ve added an Apache Sentry HDFS plugin to ensure that a Spark user can’t get around security and privacy policies on those tables. We’ll extend the table-level protection to column- and view-level security.
Long-running Spark Streaming jobs have their own security requirements. We’ve added automatic credential renewal for long-running stream jobs. We’ll add authenticated connections for the Apache Kafka Direct Connector soon.
Spark dramatically outperforms MapReduce on latency and throughput, but today it simply can’t compete with MapReduce on scale. Our big customers are chewing through petabytes of data on thousands of nodes with MR jobs. If Spark is genuinely to replace MapReduce for general-purpose workloads, it has to scale that big, and bigger, in the Hadoop ecosystem.
The One Platform Initiative sets ambitious goals for Spark scalability: handle jobs on thousands of executors each, running simultaneously on large multi-tenant clusters with over 10,000 nodes. Running at that scale requires improvements to a variety of components in the Spark execution framework.
YARN is the resource management framework for Hadoop. As with Spark, Cloudera was the first Hadoop vendor to ship YARN to customers. YARN handles resource allocation and management among all of the engines we ship in CDH, so that multi-tenant and multi-framework applications run well.
Over the last year, we’ve made Spark’s Dynamic Executor Allocation on YARN production-ready. We’ve improved integration with HDFS to enable scheduling based on data locality (SPARK-4352, SPARK-1767) and cached data. When we tested current-generation Spark at scale, though, we uncovered design and implementation issues in the core scheduler. We are now revamping the scheduler logic to ensure resiliency when tasks fail (SPARK-5259, SPARK-5945, SPARK-8029, SPARK-8103). We’re augmenting Spark’s internal data movement to handle large datasets more quickly and more reliably.
In collaboration with our close partner Intel, we’ve set up large clusters for scale testing of Spark. There’s still work to do on dynamic resource allocation and on other features aimed at cluster-, job- and task-level elasticity. Among those enhancements, to reduce memory pressure on large Spark jobs, we’ll take advantage of HDFS’ Discardable Distributed Memory support.
Tackling scale entails work not only within Spark, but also in the tooling around Spark — tools are part of the One Platform Initiative.
The Spark Job History Server needs to be substantially improved to scale to thousands of jobs running thousands of tasks across tens of thousands of nodes. We’ll also need the libraries and components that Spark uses to perform well, and to scale. To that end, with Intel, we’re investing in hardware optimization across the platform for next-generation CPUs, memory, storage and networking. We’re taking advantage of low-level enhancements like the Intel Math Kernel Library to boost performance for common Spark workloads like machine learning and numerical analysis.
People are justifiably excited about Spark’s capacity to do high-powered analytics. What is often lost, in all that excitement, is the practical challenge of deploying and operating Spark in production. Large-scale distributed systems are hard; good management tooling is critical.
Integrating Spark with YARN was a no-brainer. YARN lets users centralize deployment and management. Users can dynamically share the same pool of cluster resources for Spark, MapReduce, Impala, Hive, Pig, Search and other frameworks. Cloudera engineers were closely involved in the Spark/YARN integration (SPARK-1101, SPARK-3492), especially on stability and operations. We built a single common interface for launching Spark applications and added additional metrics for easy diagnostics.
As above, though, there is more work to be done.
We’ll improve Spark-on-YARN for better multi-tenancy, performance and ease of use.
Spark needs to report more consumption and utilization metrics so that operators can keep jobs on the cluster running well.
Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change. We can do much to automate configuration. The platform needs to learn about usage patterns as they happen, and to optimize itself over time.
Python is a very popular language among data scientists, and PySpark is a very popular package. Getting the right Python language libraries installed for use is too hard today. The management system can absolutely do a better job there.
Streaming workloads — internet-of-things (IoT) data ingest and similar real-time data capture and analysis — represent nearly a quarter of the total production deployment of Spark among Cloudera customers. Making Spark Streaming resilient (SPARK-4026, SPARK-4027, SPARK-6222), so that service outages didn’t cause data loss, is crucial to those customers. At Cloudera, we’ve invested substantially in those improvements. We also added integration with the Flume data ingest framework so new applications could take advantage of existing data pipelines.
Even though Spark is fast, there’s room for improvement in stream processing. Performance will continue to be a focus area across the platform, but in Spark Streaming in particular, there are some obvious changes we can make, in persistent mutable state management and elsewhere, that will deliver some big benefits.
Real-time workloads like that are a huge growth area for big data. It’s still too hard, though, to build stream processing pipelines — you have to be a programmer. We need to make it easy for non-coders, including business analysts, to do that for themselves. We’ll add higher-level language extensions, a simple declarative interface and simple authoring tools to make that happen.
Stream ingest jobs generally run for days, months or years. Updates to the application code and underlying platform can’t shut those jobs down. The platform has to allow new releases to be installed without taking the service offline.
So Hadoop is dead, right?
That would be a fun story, but no.
When Google originally built the system that Hadoop is based on, it created two things: a scale-out storage platform (in Hadoop, this is called HDFS), and a general-purpose processing framework called MapReduce.
MapReduce was first, but there was no reason it had to be the only choice. The notion of “send the compute to the data” works just as well for distributed SQL query processing (see the Impala framework), distributed document indexing and search (Apache Solr is the basis for our Cloudera Search offering), and even third-party proprietary engines (the SAS High-Performance Analytic Server). All of these run on top of HDFS. All take advantage of massively parallel processing. All can work together on the same data.
Hadoop benefits precisely because it is so flexible — because it supports a variety of different frameworks. Spark joins that collection as part of the One Platform Initiative. It’s integrated with the underlying storage layer, so can share data easily with the rest of the frameworks. Security, governance, management, operations work across them all. Spark is a powerful and flexible addition, to be sure, but far from threatening Hadoop, it actually makes the Hadoop ecosystem much better.
The open source Hadoop platform is, hands down, the most interesting innovation in data management in the last two decades. Since its inception, it’s evolved dramatically — in its use of hardware, in the kinds and quantities of data it can handle, and in the variety of analytic tools it offers users. There’s no question that innovation will continue.
It will be driven, in no small part, by the great work of the global open source community. The emerging big data requirements coming from large enterprise users, though, offer direction to the community: What workloads matter? What data questions are hard to answer?
That will be our focus here at Cloudera. We don’t work alone — the greater community, and Databricks especially, are also investing deeply in Spark. We expect the community to grow further, and we encourage others to get involved.
We’re convinced that Hadoop is the platform that will dominate the landscape in the next decade. Spark is a key part of that platform. Bringing it in more completely, and delivering the same security, governance, operational and other strengths Hadoop offers, is crucial.
If you’d like to learn more about the One Platform Initiative:
Register for the webinar on September 24
Follow our blog for regular updates on the future of Spark
It’s been just about two years since we first welcomed Spark to the family. I am excited to see what happens over the coming years!