The Next Generation of Analytics

Categories: Corporate Data Science Enterprise Data Hub Open Source Software Product Security Spark YARN

Last year, Mike Olson took the stage at Strata + Hadoop World and announced that “Hadoop will disappear.” Not literally, of course. Hadoop is emerging as the platform powering the next generation of analytics. But to deliver on its promise as an enterprise data hub – one platform for unlimited data, creating value in a variety of ways – it has to enable business applications. Hadoop cannot remain an exclusive, specialized skill. Like the relational database management systems that power most of the online world today, Hadoop must recede into the background. It must evolve.

Over the past decade, the Apache Hadoop community has worked at a furious pace to realize this vision. We have seen tremendous progress, as Hadoop has transformed from a monolithic storage and batch architecture, to a modern, modular data platform. Three years ago, Hadoop became interactive for data discovery through analytic SQL engines like Impala. Two years ago, Cloudera was the first to adopt and support Apache Spark within the Hadoop ecosystem as the next-generation data processing layer for a variety of batch and streaming workloads, delivering ease of use and increased performance for developers.

But there is still more to do.

This week at Strata + Hadoop World, we are pleased to announce three new open source investments to directly address some of the most fundamental challenges our customers face: Improving Spark for the enterprise, making security universal across Hadoop, and developing a fundamentally new approach to Hadoop storage for modern analytic applications.

Better Data Engineering: Spark and the One Platform Initiative

Before we can even begin to discuss analytics, we need to address data engineering, the foundational role of the next generation of analytics. Data engineers are generally responsible for designing and building the data infrastructure, in collaboration with the data science team. Spark’s meteoric rise in popularity owes much to the ease of use, flexibility, and performance that are critical for good data engineering. Of course, in addition to data processing, applications also need ways to ingest, store, and serve data, and enterprise teams need tools for operations, security, and data governance. This is why Spark is such a natural complement to the comprehensive Hadoop ecosystem.

Over the last 18 months, over 150 Cloudera customers have deployed Spark workloads on Hadoop in production, across industries and for multiple use cases. We have seen first hand where Spark succeeds, and where it still needs work. This enterprise experience, coupled with our deep bench of Spark committers – more than all other Hadoop vendors together – and broad participation in the Hadoop community, uniquely positions Cloudera to drive Spark and Hadoop forward, together.

To formalize this commitment, Cloudera recently launched the One Platform Initiative to accelerate Spark development for the enterprise, and to better integrate it with the Hadoop ecosystem. Our focus will be on the areas where we’ve seen the greatest customer need, in particular this means management, security, scale, and stream processing. And one of the most critical places to start is security.

Comprehensive Security: Unified Access Control Policy Enforcement

The ability to access unlimited data in a variety of ways is one of Hadoop’s defining characteristics. By moving beyond MapReduce, users with more diverse skills can gain value from data. Complex application architectures that required many separate systems for data preparation, staging, discovery, modeling, and operational deployment can be consolidated into a single end-to-end workflow on a common platform. Of course, this flexibility must be balanced with security requirements. To ensure that sensitive data cannot fall into the wrong hands, a comprehensive security approach must ensure that every access path to data respects policy in the same way, down to the most granular level.

However, the reality today is that each access engine handles security differently. For example, Impala and Apache Hive offer row and column-based access controls, with shared policy definitions through Apache Sentry. On the other hand, Spark and MapReduce support only file or table level controls. This fragmentation forces forces a reliance on the lowest common denominator — coarse-grained permissions — resulting in several bad outcomes: Limitations of data or access. Security silos or, worse, inconsistent policy due to human error in policy replication. Ultimately, the issue constrains the types of applications you can build.

To address this need, Cloudera is excited to announce RecordService, the first unified role-based policy enforcement system for the Hadoop ecosystem. Coupled with Apache Sentry, the existing open standard for policy management, RecordService brings database-style row and column level controls to every Hadoop access engine, even non-relational technologies like Spark and MapReduce. It works with multiple storage technologies, from HDFS to Amazon S3, so your security team doesn’t have to worry about differences in data representation. By providing a common API for policy-compliant data access, it helps you integrate third party products into your Hadoop cluster with trust. It even provides dynamic data masking for the first time in Hadoop, everywhere.

RecordService is here to help more users gain value from data, securely, using their tools of choice. Next, we need to focus on a more fundamental problem: How we store data for the next generation of analytics.

Fast Analytics on Fast Data

The next generation of applications built on Hadoop are becoming more real-time, by collapsing the distance between data collection, insight, and action. In the best case, analytical models are embedded right in the operational application, directly influencing business outcomes as users interact with them. Or consider a simpler case, an operational dashboard, which requires the ability to integrate data and immediately analyze it.

It turns out that this is pretty hard in Hadoop today, and it’s because of storage constraints concerning updates. Users face an early choice: Do I pick HDFS, which offers high-throughput reads — great for analytics — but no ability to update files, or Apache HBase, which offers low-latency updates — great for operational applications — but very poor analytics performance? Often the result is a complex hybrid of the two, with HBase for updates and periodic syncs to HDFS for analytics. This is painful for a few reasons:

  • You need to maintain data pipelines to move data ensure synchronization between storage systems.
  • You are storing the same data multiple times, which increases TCO.
  • There is latency between when data arrives and when you can analyze it.
  • Once data is written to HDFS, if you need to correct it for any reason, you’ll need to rewrite it (remember, no updates).

Over the past three years, Cloudera has been hard at work solving this problem. The result is Kudu, a new mutable columnar storage engine for Hadoop that offers the powerful combination of low-latency random updates and high-throughput analytics. This powerful combination enables real-time analytic applications on a single storage layer, eliminating the need for complex architectures. Designed in collaboration with Intel, Kudu is architected to take advantage of future processor and memory technologies. For the first time, Hadoop can deliver fast analytics on fast data.

Looking Ahead

Hadoop has come a long way in its first 10 years. As Matt Aslett of 451 Research recently summarized, “Hadoop has evolved from a batch processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem that includes in-memory processing and high-performance SQL analytics. Naturally Hadoop’s storage options are also evolving and with Kudu, Cloudera is providing a Hadoop-native in-memory store designed specifically to support real-time use-cases and complement existing HBase and HDFS-based workloads.”

And this is just the beginning. With Spark as the new data processing foundation, a new unified security layer, and a new storage engine for simplified real-time analytic applications, Hadoop is ready for its next phase: Powering the next generation of analytics.

We’re excited to work with our customers and the community to continue advancing what’s possible.


2 responses on “The Next Generation of Analytics

Leave a Reply