New Capabilities for Apache Spark Users

Categories: Cloud Open Source Software Spark


In September 2015, Cloudera launched the One Platform Initiative to make Apache Spark the default engine for Cloudera’s modern data platform. At the time, we had about 150 customers using Spark, many of them for simple ETL and data processing. As originally announced by Chief Strategy Officer Mike Olson, the program encompassed four areas: security, scale, streaming, and management.

Last May, we reported substantial progress in all four areas. Some of the enhancements we reported include:

  • Improved memory management and configuration
  • Improved Python integration
  • Support for encryption over wire
  • Integration with Intel AES
  • Stress-testing at scale with mixed multi-tenant workloads
  • Improved streaming state management

We also announced an expansion of the project to support the functionality needed to make Spark useful in the public cloud. These enhancements included such things as support for transient clusters, seamless elasticity, and support for spot instances. We also noted that Hive-on-Spark graduated from Cloudera Labs to General Availability.

Today, we support more than 550 customers on Spark, a 270% increase in adoption in less than two years. These customers include British Telecom (BT), an early adopter of Spark. BT has used Spark extensively to transform data processing operations.

Mission accomplished; Spark is now the default processing engine for Cloudera customers. Nevertheless, work continues; in this post, we announce four new capabilities that help enterprises deliver value with Spark.

Data Lineage. In CDH 5.11, Cloudera Navigator tracks lineage for data produced from Spark applications. Lineage is critical for data discovery, continuous optimization, data quality, auditing, and reproducibility. Regulated industries and use cases require data lineage capabilities for compliance and transparency.

Cloudera Navigator enables users to explore and tag data from all sources, including Spark, with an intuitive search-based interface. By consolidating metadata, and supporting custom tags and comments, Cloudera Navigator makes it easy to track, classify, and locate Spark data.

You can learn more about Cloudera Navigator here.

Support for Azure Data Lake Store. With CDH 5.11, Cloudera users can write Spark applications that run on Azure Data Lake Store (ADLS), Microsoft’s new secure and massively scalable data store. This enhancement confirms Cloudera’s commitment to multi-cloud support, cloud portability, and hybrid cloud. Microsoft’s Paige Liu details the benefits of running Spark in Cloudera on ADLS.

Workload Analytics. Cloudera recently announced Altus, a fully managed Platform-as-a-Service cloud offering that supports Spark clusters for data engineering workloads. Users do not need to manage or operate the cluster. Altus’ embedded workload troubleshooting capability identifies and diagnoses issues with Spark jobs that fail or underperform. It provides streamlined access to logs, metrics and configuration details that persist after transient clusters shut down. With Altus, customers can optimize the cost of running large Spark workloads in the cloud.

You can learn more about Cloudera Altus here.

Secure Apache Kafka Integration. In Release 2.1 of Cloudera’s distribution of Apache Spark, Spark Streaming jobs can now read from a Kafka cluster secured with Kerberos. This feature s critical for use cases where security and privacy are a primary concern.

You can learn more about Cloudera’s support for Apache Spark here.

If you plan to attend Spark Summit 2017 at San Francisco’s Moscone Center June 5th-7th, don’t miss these sessions:

  • Jennifer Wu explains how to run data engineering workloads in the cloud.
  • Mark Grover delivers a presentation about common mistakes when writing streaming applications.
  • Jordan Volz reviews the rising role of Apache Spark in compliance use cases.

Come and visit us at booth #401.


Leave a Reply