If you are a big data practitioner, let me confirm something you have strongly suspected: Apache Spark will replace MapReduce as the general purpose data processing engine for Apache Hadoop.
Spark’s success is due to the combination of dramatic speed improvements, with an API that is significantly more intuitive, expressive and flexible.
Spark was originally written to speed up iterative data processing algorithms, such as those used to train machine learning models. But as Spark matured and stabilized, it established itself as a fast general purpose compute engine for processing large volumes of data. This is exemplified by the special purpose frameworks written on top of Spark for stream and graph processing. Spark Streaming, the framework for processing continuous streams of data, saw phenomenal adoption over the past year and has established itself as the leading framework for stream processing.
We, at Cloudera, have been ardent supporters of Spark for over two years. However, this article is not about the many virtues of Spark. Spark’s amazing features have been covered ad nauseam in blog posts, media articles,and conference presentations. In this article, let us instead look at how the Hadoop ecosystem will evolve with Spark as its core compute engine.
Future of Data Processing on Hadoop
What will big data processing on Hadoop look like over the next few years? Most batch data processing jobs will be written in Spark. Jobs that perform ETL processing, predictive model training, large scale search indexing, exploratory data analytics will all be written in Spark. Spark Streaming will become the de facto standard for writing jobs that process continuous streams of data in real-time. However, despite its speed and flexibility, Spark will not be able to cover the entire range of workload types on Hadoop. For BI workloads (where low-latency SQL access with high concurrency is critical) MPP systems like Impala are necessary. Use cases that involve indexing for fast search and retrieval of data, particularly textual data, will also continue to be a good chunk of big data workloads. For these workloads, a massively parallel distributed search framework like Apache Solr will remain essential.
Is MapReduce dead then? Is it time to start authoring its eulogy? Not quite. MapReduce jobs that crunch through petabytes of data are run on a daily basis at organisations across the world. Spark has not been validated at petabyte scale. It will invariably get there, but until then MapReduce will be the tool of choice to reliably run petabyte scale, extremely disk IO intensive workloads.
More to Come…Stay Tuned
Spark has had a meteoric rise in popularity and adoption. It became a top-level Apache project in early 2014, and in less than a year and a half, it has established itself as the data processing engine of choice for Hadoop. Cloudera is proud to be one of the critical drivers of the success of Spark. We were the first large vendor to ship Spark, and in the past year and a half, we have lead hundreds of customers to success with Spark. Together with our close partner Intel, we have contributed almost 800 patches to Spark. Some of our significant contribution areas are: Spark on YARN, dynamic resource allocation, Spark Streaming resiliency, Kerberos integration, as well as features and bug-fixes for improved stability and debuggability. Not to mention better ecosystem integration via projects like Hive On Spark, SparkOnHBase, Crunch on Spark, and Pig on Spark.
Plenty of work has been done, but there is plenty more to do. Our engineering investments in Spark continue to grow. We will continue to drive improvements in areas like security, governance, performance at scale, ecosystem integration, stream processing, machine learning, and usability. Be on the lookout for more on these in a subsequent post.
What Makes a Comprehensive Big Data Platform?
Let’s look at the components that are fundamental to a comprehensive big-data platform:
- Data Processing engines. We covered these in detail in the previous section.
- Data Storage layer that is reliable, scalable, and cost-effective
- Data Catalog. With multiple ways of accessing and processing the data, it is essential to have a central catalog that has metadata about the organisation and layout of the data.
- Resource Management layer enabling multiple diverse processing engines and workloads on shared big data infrastructure
- Streaming Data Channel. A reliable, low-latency, high-throughput channel for continuous data streams.
- Unified Administration for easy management and troubleshooting
- Comprehensive Security & Governance for compliance-ready protection and visibility
In certain cases, especially if you are a small organisation with narrow data processing needs, you can get away with a platform that provides only a small subset of the above components. For example, running batch Spark jobs on data in S3. However, any organisation that has massive volumes of data and wishes to maximize the value derived from its data, will need to invest in a comprehensive Hadoop based big data platform like Cloudera Enterprise. Cloudera Enterprise provides components that satisfy all of the aforementioned requirements and these components are tightly integrated with each other.
The best part of Hadoop is that it is modular, with different pluggable implementations for different components of the platform (as evidenced by the availability of different processing engines). It is this modularity that continues to enable constant innovation in the ecosystem, and lets it evolve as big data needs evolve.
Spark has recently received immense media coverage, rightfully touting it as the next generation, big data engine. This has led some folks to believe that it is a replacement to Hadoop. However, a data processing engine, no matter how powerful or flexible, provides only a subset of the functionality that a comprehensive enterprise-grade, big data platform needs to provide. For success with big data, enterprises will invariably run Spark as an integrated part of their overall Hadoop deployments.