Though Apache Spark was first created nearly three years ago, the past year has seen tremendous growth and adoption of the project. Spark has now become the most popular Apache Software Foundation project, with fifty-percent more activity than the core Apache Hadoop project itself, and over 750 contributors across hundreds of companies.
As part of this, Cloudera’s own team of Spark committers have been hard at work over the past year to drive the enterprise capabilities of Spark and better unite Spark and Hadoop. Let’s take a look at some of these key development milestones.
Bringing Internet of Things into Production
Whether it’s smart houses sending information about energy usage, smoke detectors, and alarm systems; connected cars that monitor driving behavior and vehicle health; or wearable devices that track heart rate, sleep patterns, and movement; the Internet of Things is everywhere and more and more customers want to turn this data into insights and value for their business and end-users.
To harness this data, customers needed an enterprise-grade stream processing engine to build their relevant applications. Spark Streaming was the prime tool for the job, however, there was some development work needed to ensure it could support production applications with no data loss. Cloudera led the development effort to make Spark Streaming resilient, so service outages wouldn’t lead to data loss. In addition to this critical improvement, Cloudera also added Apache Kafka, for streaming data ingest, into the platform and added Spark Streaming integrations with the Apache Flume data ingest framework so customers can easily build complete real-time streaming applications for Internet of Things use cases.
The Standard Execution Engine for Hadoop
We have long predicted that Spark will emerge as the standard execution engine in Hadoop, succeeding MapReduce due its ease of development, flexible API, and performance benefits. This past year marked critical progress in making that a reality with the beta release of Apache Hive-on-Spark. With its familiar query language, Hive is a popular tool for batch processing workloads such as ETL development and this integration with the Spark processing engine is a significant milestone supporting next-generation data integration workloads and adoption of Spark.
The One Platform Initiative also lays out a clear roadmap to accelerate Spark’s development for the enterprise and unite it within the broader Hadoop ecosystem, so that Spark applications and users get the full benefits of their big data infrastructure. Impressive progress has already been in all four of the key focus areas of management, security, scale, and streaming and we will continue to focus heavily on this development in the coming year.
Extending Usability for Data Science
In the most recent platform release, Cloudera 5.5, we have also added support for Spark SQL, DataFrames, and MLlib. With SparkSQL and DataFrames, the capabilities of Spark are extended to a wider range of developers and data scientists by allowing SQL to be seamlessly embedded within Spark applications. To further drive usability within the platform, Cloudera has worked to ensure interoperability with the Hive metastore so data and its schemas are available not just for Spark SQL when developing with Spark, but also to Hive for batch processing workloads and Impala for BI and SQL analytics.
The addition of MLlib extends Spark’s ease of use and performance gains to machine learning within Hadoop – making it easy for customers to take advantage of the pre-built library of machine learning algorithms for more efficient model development and iterative processing.
To learn more about Spark in Cloudera 5.5, register for “Hadoop for the Data Scientist”
2015 has been an impressive year for Spark – with a thriving community, wide industry adoption (Cloudera alone has over 170 customers running Spark with Hadoop), and many development milestones. For more information on what’s to come in 2016, watch “Uniting Spark and Hadoop.”