Big data has come a long way, with adoption accelerating as CIOs recognize the business value of extracting insights from the troves of data collected by their companies and business partners. But, as is often the case with innovations, mainstream adoption of big data has exposed a new challenge: how to ingest data continuously from any source and with high quality. Indeed, we have found that there are environmental causes that make it next to impossible to scale ingestion using current approaches, and this has serious implications for scaling big data projects.
In this post I will describe the new data reality that creates this challenge, focusing on a problem that we call data drift. I will cover the serious business implications of failing to deal with data drift and then explain how StreamSets Data Collector used in conjunction with Cloudera solutions handles data drift in order to provide an efficient continual ingest mechanism that delivers high-quality data in a timely fashion. As Cloudera helps customers integrate more data by making Hadoop fast, simple and secure, StreamSets complements this vision with a simple and flexible ingest infrastructure that lets enterprises embrace an ever-widening variety of data sources and big data components.
The Opportunity and Challenge of a Decoupled and Decentralized Big Data Architecture
Two defining aspects of big data are that it is often semi-structured or multi-structured, and that the source structure is decoupled from the consuming application. These create powerful leverage points: the same data can be used to power numerous analytics and processing systems, creating a new class of data systems operating on an open substrate of storage with consuming systems isolated from the idiosyncrasies of the producing systems. If handled correctly, the exploding set of new data sources can power workloads that are beyond the realm of relational systems. Enterprises get to use off-the-shelf tooling and low-level frameworks to capture and move this data to where it is consumed, ideally creating a mechanism for continuous data ingestion that keeps consuming systems replenished with the freshest information.
Yet, the path to this ideal state of continuous ingestion has many landmines. We constantly see enterprises struggling to onboard information into their big data platforms for analysis and consumption. What causes these challenges? The answer is simple, yet profound: in the new era of big data, the basic characteristics of data generation and consumption have dramatically changed. We call this data drift.
What is Data Drift?
|At StreamSets, we define data drift as the unpredictable and continuous mutation of data characteristics caused by operations, maintenance and modernization of systems producing the data.|
Data drift is an inevitable by-product of the decoupled and decentralized nature of the modern data infrastructure. It follows from the fact that most data-producing applications operate independently, going through their own private lifecycle of changes and releases. As these systems change, so do the data feeds and streams they produce. This constant flux creates havoc when trying to create reliable continuous ingest operations.
There are 3 types of drift we see in modern data systems:
- Structural drift: Also known as schema evolution, these changes could be additions to the data attributes, changes to the structure of existing attributes to accommodate new requirements, or more invasive changes such as dropping of existing attributes or incompatible changes in the representation of existing attributes.
- Semantic drift: This manifests itself when the meanings attributed to the data changes, rendering the data interpretations previously understood by consuming applications no longer applicable.
- Infrastructure drift: This relates to changes in the underlying producing, consuming or operating systems. This problem grows as data processing architectures move from monolithic traditional stacks to the more fragmented world of open source Big Data, and as control over these systems decentralizes.
Data Drift Kills Quality and Productivity
Unfortunately, if your tools, frameworks or custom-built infrastructure designed to move data do not take data drift into account, they can fail and become a bottleneck to data operations. In many cases such failures may be silent, leading to undetected data corrosion or data loss that pollutes downstream analysis and findings. In the cases when these failures are discovered, they expose the costs and risks of data drift, which can lead to foot-dragging when it comes to incorporating new sources or consuming applications. Such risk aversion may be valid in a traditional transactional system, but it could not be farther from the spirit of modern big data systems, which operate by orchestrating data flow across disparate, disjointed, and decentralized systems. The opportunity cost of forgone business insights is huge.
The brittleness, failures, and lack of control described above eventually slow down the progress of big data projects, increase their costs and impact their return on investment.
So how do we address data drift? Next week I will discuss the StreamSets Data Collector, which we built as an answer to the problems that data drift causes for Big Data ingest.
Arvind Prabhaker, Founder and CTO, StreamSets Inc.
Arvind Prabhakar is a seasoned engineering leader, who has worked on data integration challenges for over ten years. Before co-founding StreamSets, Arvind was an early employee of Cloudera, and led teams working on integration technologies such as Flume and Sqoop. A member of the Apache Software Foundation, Arvind is heavily involved in the open-source community as the PMC Chair for Apache Flume, the first PMC Chair of Apache Sqoop, and member of various other Apache projects.
Prior to Cloudera, Arvind was a software architect at Informatica, where he was responsible for architecting, designing, and implementing several core systems.