In September of last year we reaffirmed that Apache Spark will replace MapReduce as the general purpose data processing engine of Apache Hadoop. At the same time we announced the One Platform Initiative to sharpen Cloudera and the Community’s focus towards maturing the enterprise readiness of Spark, as a cohesive component of the Apache Hadoop ecosystem. After all, at Cloudera we work with over 180 enterprise customers running Spark in production – often for large-scale, mission critical workloads – which gives us enhanced insight into the key areas where Spark needs to grow to better support these use cases.
Since the announcement Cloudera and the ever-expanding Spark community have been very busy. The community released two new versions: Spark 1.5 and Spark 1.6, and we are hard at work on the next major upgrade of Spark, version 2.0. Hive-on-Spark graduated from Cloudera Labs and went GA with the recent release of Cloudera 5.7. We are already seeing tremendous adoption across many workloads and inside more organizations. Our aim is to cement Apache Spark as a first-class citizen in the Hadoop ecosystem and uplevel its role in the Cloudera Enterprise platform as the primary execution engine.
One Platform Initiative: Notable Wins
The One Platform Initiative consisted of four pillars that highlighted the critical areas where Spark needed to mature to fully support enterprise requirements: management, security, scale, and streaming.
Since September, we have made impressive progress along each of these pillars:
Security: Spark makes a major leap forward with its 1.6 release in regards to security and can now be used for workloads even in highly regulated environments. The key improvement was encryption of data over-the-wire, i.e. data being transmitted over the network between the nodes of a Spark application. Intermediate shuffle files can be encrypted as well. Due to the “in-memory” tag that is associated with Spark, people often forget that every shuffle operation entails writing data to disk, and that data also needs to be encrypted. We worked with our colleagues at Intel to deliver encryption of on-disk shuffle files using libraries that can leverage AES-NI instructions present on Intel processors, which provide special optimizations for high-performance encryption workloads.
Streaming: Spark Streaming has established itself as the de facto tool for processing continuous streams of data. We identified in-memory state management optimizations as an area for further improvement. The Spark Community has delivered with flying colors, with a new optimized API called the mapWithState() API, which can deliver 10x higher performance than the previous API for in-memory state management.
Scale: A large proportion of Spark applications are written in PySpark, which is not surprising given the popularity of Python in the data science community and the popularity of Spark for data science. However the performance of UDFs or Lambda functions in PySpark is far from optimal due to architectural issues around JVM-Python interoperability and the high cost of moving data between processes. However, a shared in-memory columnar data format can enable more efficient data interchange. In February Cloudera, alongside the community, announced the Apache Arrow initiative to define and standardize a columnar in-memory data format specification, which will be leveraged to improve the scale and performance of PySpark.
Management: There are new capabilities that will surely bring a smile to administrators managing large multi-tenant Spark clusters. Spark applications can often be memory intensive, which can make it a challenge to set memory parameters appropriately and to understand how and where memory is being utilized. Spark 1.6 takes a significant step forward by removing the need to configure separate caching and execution memory segments, and by adding metrics to track per-operator memory. Secondly, for data scientists who are used to a wide array of Python libraries, we partnered with Continuum Analytics to offer one-click deployment of Anaconda Python environments on CDH clusters via Cloudera Manager.
A New Pillar of the Initiative: Cloud
We’re excited to announce the addition of a new pillar: Cloud. A growing number of customers are running Spark in public clouds, and there are many opportunities for enhancing Spark to better leverage the unique characteristics of these cloud environments. Some of our focus areas include:
Spark on Transient Clusters: Spark jobs in particular can take advantage of the transient capabilities of deploying on the cloud, with periodically scheduled jobs able to run on clusters that are spun up when required, and decommissioned when the job is done. Significant improvements can be made to enable easier debugging and analysis of Spark jobs on transient clusters, even after these clusters are terminated.
Elasticity: One advantage of running in the public cloud is the ability to provision and acquire new resources on-demand. In such a world, isn’t it futile if one has to pre-configure a cluster for each Spark application, with the exact number of executor nodes and memory? We aim to allow Spark jobs to seamlessly grow and shrink the cluster according to the resource demands of the job.
Spot Instances for Cost-Effectiveness: The cost (pardon the pun) you pay for using spot instances is that they can get revoked when demand goes up. This is not a problem for Spark Executor nodes. However, if the Driver node is running on a spot instance, which is suddenly yanked away, it may lead to a lot of redundant recomputation of data, increasing job runtime and thus more time and money spent. Hence, Spark and the underlying resource management layer need to be optimized to leverage spot instances more effectively.
Innovation at the Edge Complements Innovation at the Core
In addition to leading and innovating in core Spark, Cloudera is also focused on building components that are critical enablers of the architectures and use cases being implemented with Spark. To this end, we are excited to be working on an open source Apache licensed REST web service, called Livy, for managing long running Spark Contexts and submitting Spark jobs. Livy makes Spark development easier, by providing a secure, multi-tenant, and fault-tolerant way of communicating with Spark Contexts over REST. It effectively provides an easy interface for end-user applications to leverage Spark, thus enabling new use cases and architectures. Livy has a budding developer community, which includes employees from several companies, including Cloudera, Microsoft, and Intel. However, more on that in the near future.
By delivering on all five pillars and continuing a healthy cadence of development we believe that Spark will continue to advance the capabilities of each of Hadoop components as Spark becomes a first class citizen in the application stack and the dominant data processing engine in Hadoop. Expect to hear more from us around One Platform and further Spark integrations in the near future. To learn more about Apache Spark or enroll in Spark Training visit our website.