About a week ago, I posted an article on Cloudera’s strategy on SQL in the Apache Hadoop ecosystem. In the article, I argued that a special-purpose distributed query processing engine will perform better than one that translates work into a general-purpose MapReduce framework, even if MapReduce is improved to trim latency and improve throughput.
Notwithstanding that bet, we simultaneously believe that the ecosystem needs a high-performance alternative to the current MapReduce implementation. That view is shared by the community generally.
In this piece, I want to walk through the history, the current status and the short- and long-term future of the Hadoop platform, concentrating especially on MapReduce.
Where We Came From
The earliest instance of the architecture at Google combined flexible, scalable storage with a single processing framework — MapReduce — to handle a wide variety of processing and analytic workloads. When Doug Cutting and Mike Cafarella co-founded the Apache Hadoop project in 2005, they adopted precisely that architecture.
Projects like Apache Pig and Apache Hive were built to turn special-purpose queries into jobs that ran on the general-purpose MapReduce framework. They inherited the scalability, fault tolerance and throughput (excellent!) and the latency (ouch!) of MapReduce. The latency of Hive in particular meant that it couldn’t serve interactive applications.
Because real-time performance matters for so many workloads, the community began to create alternative engines that ran alongside MapReduce on the same hardware, and that could use the same data in place. These were often special-purpose engines aimed at narrower problems than MapReduce. One of the most mature, and without question the most widely used, of these is Apache HBase. HBase provides very fast record storage and retrieval services. Other examples include Impala, the Apache Solr engine integrated into Cloudera Search, Storm and others.
Where We Are
MapReduce, as conceived at Google and implemented in Hadoop, is an enormously popular and widely-used engine. There are plenty of applications today that know how to decompose their work into a series of MapReduce jobs. Those applications, in production use by large enterprises, must continue to run without change.
Enthusiasm for an Enterprise Data Hub, though, and even for the Hadoop project, is tempered some by the long-standing complaint about MapReduce: High-latency, batch-mode response is painful for lots of applications that want to process and analyze data.
What the Hadoop ecosystem needs is a successor system that is more powerful, more flexible and more real-time than MapReduce. While not every current (or maybe even future) application will abandon the MapReduce framework of today, new applications could use such a general-purpose engine to be faster and to do more than is possible with MapReduce.
The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.
If you’re not interested in technical detail, skip the next three paragraphs.
Original MapReduce executed jobs in a simple but rigid structure: a processing or transform step (“map”), a synchronization step (“shuffle”), and a step to combine results from all the nodes in a cluster (“reduce”). If you wanted to do something complicated, you had to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely. Complex, multi-stage applications were distressingly slow.
An alternative approach is to let programmers construct complex, multi-step directed acyclic graphs (DAGs) of work that must be done, and to execute those DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build. Prior research on DAG engines includes Dryad, a Microsoft Research project used internally at Microsoft for its Bing search engine and other hosted services.
Spark builds on those ideas but adds some important innovative features. For example, Spark supports in-memory data sharing across DAGs, so that different jobs can work with the same data at very high speed. It even allows cyclic data flows. As a result, Spark handles iterative graph algorithms (think social network analysis), machine learning and stream processing extremely well. Those have been cumbersome to build on MapReduce and even on other DAG engines. They are very popular applications in the Hadoop ecosystem, so simplicity and performance matter.
Spark began as a research project at UC Berkeley’s AMPLab in 2009 and was released as open source in 2010. With several years of real use, it’s had plenty of time to mature. Advanced features available in the latest release include stream processing, fast fault recovery, language-integrated APIs, optimized scheduling and data transfer and more.
One of the most interesting features of Spark — and the reason we believe it such a powerful addition to the engines available for the Enterprise Data Hub — is its smart use of memory. MapReduce has always worked primarily with data stored on disk. Spark, by contrast, can exploit the considerable amount of RAM that is spread across all the nodes in a cluster. It is smart about use of disk for overflow data and persistence. That gives Spark huge performance advantages for many workloads.
Why not Improve MapReduce?
My Impala post explained our thought process when we decided to create a new open source project for SQL in the Hadoop ecosystem, rather than evolve the older Hive project. We were equally careful in thinking through the alternatives for a next-generation processing engine that eliminated the shortcomings of MapReduce.
The Hadoop community has made substantial changes to MapReduce over the last two years. A new version, known as MR2, ships as part of the Hadoop 2.0 release. That work, however, left the batch properties of the original implementation firmly in place. Key assumptions were just too deeply buried in the code. Software developers talk about “technical debt” in older code bases — a collection of decisions on design and implementation over many years, all made to solve problems of the moment and to address the urgent requirements of the past. MapReduce is, in that sense, deeply in arrears.
Redesigning the software in place, without breaking the applications that depend on it today, looked to us to be an extraordinarily difficult undertaking — risky for the developers, for the team here that supports our customers and for those customers themselves. Starting on that effort now would mean several years of laborious design and implementation before the code worked reliably in production.
Creating an entirely new (and debt-free!) code base, designed for the current- and next-generation workloads we see emerging, would be simpler, faster and much less risky. The only question, at that point, was whether we should create a new project as we did with Impala, or work on an existing one.
The choice there was easy. Spark is well-designed, has been in use for several years and is moving very fast. With more than one hundred contributors to the project working for more than twenty-five different companies, Spark has a thriving community today — the Spark Summit in early December was packed with great content. Spark is already an excellent piece of software and is advancing very quickly. No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason.
The Near Future
Earlier this year, the research team behind Spark created a company, Databricks, to develop and support Spark in mission-critical production use. Databricks has posted some considerable successes since its creation, including an impressive fundraising.
We announced a strategic partnership with Databricks in October to include Spark as an integrated part of Cloudera’s product, jointly supported by the two companies. Spark’s installed base, already considerable, continues to grow. The software is deployed in production at some of Cloudera’s largest enterprise customers already.
There’s plenty of interest in Spark today, and we expect uptake to accelerate with version 5 of Cloudera’s Enterprise Data Hub, which we announced in October. 2014 will see the introduction of new tools and applications, built by third parties, for business users to take advantage of the analytic and processing power that the platform offers. We expect other players in the market to announce formal support for Spark as well. We expect to see nascent competitive offerings, whether proprietary or open source, lose momentum; investing in Spark and an alternative is clearly wasted effort.
We recognized when we introduced Impala that the installed base for Hive would demand continued support and investment. So it is for MapReduce, despite the emergence of Spark: There are too many MapReduce applications in production, and far too much business value riding on them, for the first-generation product to fade away.
We have long worked closely with the community on features and fixes for the original project. We will continue to do so.
In the long term, though, we expect the number of cycles spent on MapReduce to diminish, and the fraction spent on new frameworks — Impala and Spark for sure — to increase. This is no surprise, really. The workloads running at some of the biggest, and earliest, adopters of this platform architecture have shifted in exactly this way. Google and Facebook, to name just two, have seen workload percentages shift away from MapReduce and toward new frameworks (Pregel and Dremel at Google, Presto at Facebook) that are designed with the lessons learned from generation one.
Updated 12/30/2013: An earlier draft listed Pregel as a DAG execution engine at Google, but Pregel is a graph processing engine. Thanks to William Vambenepe (@vambenepe) for the correction.