Enhanced Streaming and Machine Learning with Apache Spark 2.0

Categories: Spark

Apache Spark has risen to be the taster’s choice of high-scale distributed computation and solidified itself as the de-facto processing engine in the Apache Hadoop ecosystem. In fact, recently Curt Monash of DBMS2 wrote, “The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation.” But the Spark ecosystem is not done addressing the challenges that users face in leveraging streaming and complex data types while enabling data access and simplified interaction. In 2015, Mike Olson introduced The One Platform Initiative – a project to unite development efforts and advance Spark’s role in the Hadoop ecosystem. This initiative was broken into five key focus areas; streaming, security, management, scale, and cloud. Understanding Spark’s importance in these five areas and developing a roadmap of features and supportability has been crucial to our customers. Cloudera is focused on up-leveling Spark’s importance.  This is done by advancing projects in the open-source ecosystem while ensuring that Spark meets security and stability requirements in Cloudera Enterprise.

When we condense all the great things that Spark 2.0 promises we can view them largely on how they advance the possibilities of machine learning and streaming. Spark 2.0 is now fully supported by Cloudera.  

The Emergence of Machine Learning

Machine learning and artificial intelligence (AI) have been evolving conversations and they promise an exciting new vector in the world of advanced analytics. In a recent MIT study on machine learning adoption users were asked about how they implement ML technology. When we break down implementations by strategic goals we can see how machine learning is furthering business goals. The report stated that 76% of respondents used machine learning to target higher sales growth, 40% used them to improve sales and marketing performance, and 10% used machine learning to increase product sales and reduce churn. Machine learning can help us become proactively smarter with our data and how the models we build work at scale. Enterprises are using machine learning to better serve their customers with higher relevance. Machine learning has also changed the way that our data teams are structured. We now see a common interaction between data scientists and engineers where machine learning inputs are gathered and predictions are made or are saved and evaluated in the future. Saving and loading models easily can help ease the burden when data scientists want to iterate quickly. As the organization requires more data it puts pressure on machine learning algorithms to scale outside of a single node environment.  It easy to see the proof of machine learning’s emergence in the amount of supported libraries in Apache Spark project alone.   

The Importance of Streaming

An increased focus on real-time decision making has led to a peak in interest in leveraging streaming data. Many streaming use cases involve machine learning, and the ability to leverage streaming data is becoming more of a reality for users. This is also spawning new use cases in predictive medicine and operational efficiency especially those leveraging the internet of things (IoT). A breadth of solutions are being developed to harness the important streaming sources that are attractive to modern organizations. Tools in the Hadoop ecosystem, such as Spark Streaming, Apache Kafka, Apache Flume, Apache HBase and Apache Kudu, now make it easy to build robust and reliable pipelines to collect, transport, process and serve continuous data streams in real-time, to end-user applications. However, simply capturing data-in-motion isn’t enough. A core component of building a streaming pipeline is the stream processing engine, which has to provide common abstractions and needs an easy to use APIs to define computations, deliver appropriate performance, while also handling the required level of fault tolerance. Spark Streaming embodies these qualities and has established itself as the leading stream processing engine, with production deployments across hundreds of customers.  

Spark 2.0’s Promise

The Spark community has recently reached an exciting milestone in releasing the second full release of the Apache Spark software (2.0). Many practitioners will benefit from the advancements in Spark 2.0 and Cloudera aims to provide the new version in short order with the guarantees necessary to ensure production grade durability. Cloudera announced in 2014 support for Spark Streaming and has since focused on adding enterprise grade security as well as seamless integration with other components of the canonical hadoop streaming architecture. Spark 2.0 provides an experimental release of the next iteration of spark streaming called structured streaming. Structured Streaming is the first streaming API running on top of SparkSQL. Now users can program against streaming sources with the familiar Dataframe API while query planning automatically incrementally increases. This is aimed at easing the burden and unpredictability of streaming data and further aims to provide better API access to Spark ecosystem components.

Another refreshing new features is the ability to save and load these ML models via MLlib persistence. Mllib added newer algorithms to enhancing the machine learning experience in Spark. Lastly, an newer and more efficient serialization of vectors is reducing overhead and increasing performance for MLlib algorithms so users can expect better performance in certain use cases.

The new capabilities bring some exciting opportunities for developers, data scientists, and engineers but what does this mean for instances running in production today? That is where Cloudera is focusing over the coming months, to ensure the new functionality addresses our requirements around ecosystem interoperability and enterprise hardening. In addition, while the new capabilities advance the state of focus areas like streaming and scale we must ensure that end-to-end security is addressed while making it possible for users to run Spark in their Cloudera environment the same on premise as they do in the public cloud.

Download Spark 2.0


Leave a Reply