Analytic database systems are the focus of an evolving conversation aimed at delivering highly performant analytic functionality over vast amounts of varying data types. Not just structured or unstructured data; but streams of data and time relevant data that stress the limits of conventional data platforms. In addition, the introduction of modern hardware profiles is equipping organizations to meet the demands real-time and interactive data. With this shift, users are no longer managing against a scarcity of resources but are instead empowered to build new functionality that increases their companies’ ability to better apply data analytics to a variety of sectors of the business. It also means that instead of deploying purpose built analytic databases to meet the defined demands of the data you are using today we are instead capturing, integrating, and incorporating more data points from streams, external sources, and internal core systems.
Apache Hadoop has already delivered on the promise of limitless analytics by providing a distributed framework to allow the collection of massive amounts of data that scale outside the scope of most analytic database environments. At the same time, by bringing compute resources closer to the data Hadoop has increased analytic performance by removing many common resource bottlenecks. Hadoop users also now have a choice of processing frameworks and file systems to meet the discrete demands of their use case without the need to to employ multiple technology solutions.
Common real world examples of how Hadoop has changed our ability to deliver analytic value include helping retailers provide real-time offers through recommendation engines and enabling rapid location based targeting from mobile sources. This is reshaping how marketers target their customers and shape their future product development. Increasingly, time series and high velocity data is being leveraged to provide point-in-time relevance in scenarios that require actionable performance. Children’s Hospital of Atlanta leverages time-series data along with Impala to better understand data from bedside monitors. By feeding this time series data to an analytic environment they were are able to achieve near real-time event monitoring during surgery and recovery. In this example, their project was merely an underfunded experiment. By showing the value and capabilities of an advanced analytic environment they were able to expand their scope to modernize their organization. While not all use cases are as altruistic as Children’s Hospital of Atlanta, the move towards becoming actionable in response to data is becoming an increasingly powerful ability.
With the realization of these new capabilities comes the desire to leverage even more types of data. In the past two years we have seen a huge breath of innovation around solutions for streaming and online data formats as well an increase in analytic tooling that is opening up developer access of the new data types. Some data types have remained difficult including the complex nature of rapidly-changing, “mutable” data types. Here are some examples of the requirements that mutable data demands of analytic systems:
1. Time series data, where users needed insert, update, scan, and lookup capability to address use cases such as real-time streaming.
2. Stock Market Data, where users need to run analytics on a full data set while new information streams in updates in real-time.
3. Fraud Detection, where systems need to analyze data immediately to actively detect fraudulent activity.
4. Operational Data scenarios, users needed the ability to store logs for easy lookup and have reliable information for analytic model building.
Modern solutions do exist for mutable data types but may require yet another technology deployment or result in redundant storage. Existing solutions have also been plagued by some common drawbacks including poor analytic performance, complex application design, and security/policy enforcement across multiple access engines. Mutable data types inside of Hadoop have often been handled by data stores like HBase but often at the sacrifice of analytic performance which forced developers to leverage both HDFS and HBase to strike the balance.
Enter Kudu. Kudu is a new updatable columnar store for Hadoop, designed for fast analytic performance. It simplifies the architecture for building analytic applications on changing data, complementing the capabilities of HDFS and HBase. A simpler architecture providing superior performance all in a single data store to support increasingly common real-time use cases. Kudu’s entrance greatly enhances the performance of Hadoop components like Impala and is helping continue to drive Impala’s performance leadership in the ecosystem.
Another benefit of Kudu is that instead of making developers make a design choice between the scanning analytic capabilities of HDFS and the insert and update capabilities of Apache HBase. This eliminates the need to explore tiering solutions that complicate Hadoop’s unified design.
As the storage ecosystem for Hadoop expands we need to continue to consider the implications to security and data access. How do we ensure we continue to evolve meet the demands of new data types without compromising the enterprise requirements of current production systems?
RecordService is the new role-based policy enforcement engine for Apache Hadoop that complements services like Cloudera Navigator and Apache Sentry. A common effect of opening access to data to a wider variety of business and technical users is a fragmentation of policy enforcement across access engines. In order to create true widespread data discovery users need to analyze data in many ways; from SQL to Search. Each engine affords users the ability to leverage data in their own way but each has their own level of of policy granularity, like column/row level vs. file/table level. This makes developing a unified security approach across a data hub architecture a daunting task, as each access engine is held to a different standard. Previous workarounds that addressed varying degrees of policy enforcement required making multiple copies of data with sensitive attributes missing, which was the equivalent to playing madlibs with your analysis. Piecing together contextual information was needed in order to retrieve the right correlations. Increasingly, more companies are using big data systems to handle highly sensitive data and the problem simply can no longer be ignored.
RecordService aims to solve this issue and enable an enterprise data hub to be a single platform for a wide variety of analysis without negating the security requirements needed to keep your data safe. By providing the controls that allow us to integrate sensitive data sources we are creating a better, full-fidelity view of data. With RecordService as a centralized policy enforcement solution we can continue to add new features to Hadoop with a core standard of policy management.
So what’s next?
Kudu and RecordService are excellent examples of how Apache Hadoop is advancing the state of modern analytic databases. We will be continuing this conversation in an upcoming webinar about the roadmap for Impala and hear from the father of data warehousing Ralph Kimball as he talks about evolution of the data warehouse.