Apache Spark in the Apache Hadoop Ecosystem

Categories: Open Source Software

At the recently concluded Spark Summit conference, Mike Olson spoke about the emergence of Apache Spark as a new standard for Hadoop data processing. As part of that, we announced an industry wide collaboration with key organizations in the Hadoop community to evolve projects built on top of MapReduce to migrate to the Spark execution engine and spearheaded efforts for Spark integration with common Hadoop architectural layers like YARN to ensure Spark can run alongside purpose-built engines like Impala and Solr.

Two months have passed since the announcement, and we thought the time was right to review the progress on this initiative.

Apache Crunch

The SparkPipeline implementation for Apache Crunch development is complete and will be recommended for use from the 0.11 release of Apache Crunch which will be part of CDH 5.2 in the upcoming months. With this change, Apache Crunch provides a high level interface to develop data processing applications on Hadoop but has the additional benefit of providing an easy way to migrate from MapReduce to Spark.

In fact, several of our field team have already used this capability to work with customers in developing MapReduce applications using Crunch and then seamlessly migrated the same application to Spark by simply switching the pipeline implementation to Spark. As such, even today, Crunch provides a compelling way to build data applications that are future proof to changes in the execution stack.

Kite SDK

The Kite SDK provides high level abstractions to work with datasets on Hadoop, hiding many of the details of compression codecs, file formats, partitioning strategies, etc. In the recent Kite 0.16 release, Spark support has been added to Kite as well so datasets in Kite are now accessible through Spark.

Apache Solr

Apache Solr, the most widely adopted, feature rich, and mature open source search engine (built on Apache Lucene), is also the standard search framework for Hadoop.

The next addition to cross-workload search is indexing large and complex data sets, at scale, via Spark. With the addition of a Spark-based indexing tool, Solr users can expect many benefits including:

  • Fast and easy indexing and re-indexing – on-demand

  • Iterative prototyping of ingestion pipelines

  • Reduced latencies for ingestion as well as indexing jobs

  • Ability to have graph and other ML complex processing as an integrated pre-processing step for serving searchable complex data

In an interesting twist, Solr has been able to take advantage of the work done in Crunch to support Spark to expedite the time to delivery of a Spark based solution! We are happy to report that the Cloudera Search team has made rapid progress and we expect to deliver a solution in the upcoming two months.

Apache Pig

Our friends at Sigmoid Analytics have been making rapid progress in getting huge performance wins for Pig by running on the Spark engine. We are happy to report that they have successfully passed 100% of the end to end test cases on Pig. This is an exceptional milestone worth celebrating.

As this code is being moved over to an upstream Pig branch, we expect rapid iteration to fix outstanding issues and address additional Spark functionality through the community before the integration becomes available in an upstream Apache Pig release in the very near future.

Apache Hive

Another important project moving over to Spark has been Apache Hive. Since the original proposal was floated upstream, there has been tremendous activity from many interested parties in moving this project forward.

With contributions from multiple organizations, including Cloudera, Databricks, IBM, Intel, and MapR, this is truly a community effort. While some of us may compete in the marketplace, we also recognize that this key ecosystem project is a huge advance for users. Not only will this collaborative effort improve Hive performance, but also drive standardization with Spark as the execution backend – making management and development easier and more productive.

There has already been close engineering collaboration on this effort, with nearly 100 patches committed to Hive to date. Forward progress across the community is certainly a milestone worth celebrating.



The movement to Apache Spark, while young, has been very rapid and we expect that this momentum will only grow. There is a lot to celebrate in the improved performance and other technology benefits that this effort brings.

For those of us who live in the Apache Hadoop open source community, we are particularly proud of the outstanding collaboration we have seen as multiple organizations have endorsed the vision Mike Olson outlined in his post last year, and now work together to establish Spark as the successor to MapReduce.

For more information about Spark and its roadmap, register for the webinar, “The Future of Hadoop: A deeper look at Apache Spark” on Thursday, September 25th.


6 responses on “Apache Spark in the Apache Hadoop Ecosystem

Leave a Reply