Previously, we announced that the leaders in the data governance space have joined Cloudera to provide a unified foundation for open metadata and end-to-end visibility for governance. Today, we are happy to host this guest blog from Venkat Subramanian, Chief Technology Officer and Subra Ramesh, VP of Engineering of Dataguise.
People often refer to Big Data in the context of 4 Vs: Volume, variety, velocity, and “veracity”. (A good backgrounder on how we got the first three Vs can be found here: http://tinyurl.com/c4l6rhw) Veracity, as the new kid on the block, speaks to the tricky nature of data quality in Hadoop. With statistics such as “bad data can cost businesses up to 12% of their revenue” (source: Experian Data Quality), it’s perhaps for good reason that people talk about data veracity as a key big data challenge.
Cloudera has developed Cloudera Navigator, which seeks to provide end-to-end data governance for Apache Hadoop-based systems. Cloudera Navigator provides a rich set of features that span four key areas: comprehensive and unified auditing across Hadoop, unified and searchable technical and business metadata, lineage, and lifecycle management. With the inclusion of Navigator in Cloudera’s Accelerator Program, Cloudera is providing an open API framework that ensures that metadata from different repositories and systems can be automatically shared and easily searched, viewed and managed.
That’s why Dataguise is excited to be joining the Cloudera Accelerator Program and taking advantage of the APIs in Cloudera Navigator to fuse intelligent and automated sensitive data discovery directly into Navigator.
Using Dataguise’s sophisticated discovery of sensitive data within HDFS, and during ingest via Flume, FTP, and Sqoop, customers can create an interactive reporting system that gives precise details about the location and type of sensitive data across the entire cluster. Now, in a very simple, easy, automated two step process, all files containing sensitive data of any type (credit cards, social security numbers, bank accounts, addresses, names, blood types, etc.) can be automatically detected, counted, and reported from Dataguise, and directly ingested, uploaded, and tracked as smart metadata tagging in Cloudera Navigator.
The diagram below depicts the architecture of the integration.
Users discover sensitive data using DgSecure’s Discovery functionality – either at rest in HDFS or Hive, or during ingest into Hadoop via Flume, Sqoop, or FTP. The HDFS and Hive Discovery uses MapReduce to fully exploit the parallelism of the cluster. In the near future, Dataguise Discovery will also leverage Spark in addition to MapReduce.
Dataguise Discovery Task sample results: Results above depict a single task run. Aggregated results are sent to Navigator as tags.
Once the sensitive data has been discovered, the results need to be sent to Navigator to augment existing metadata. DgSecure’s DgMAS (Dataguise Metadata Aggregator and Summarizer) module picks up the results of Discovery, and pushes these results to Navigator via Navigator’s REST API, as custom tags. The frequency of the push, as well as the abstraction level at which data sensitivity is reported in Navigator, can be controlled by configuration settings of the DgMAS module. Users have the flexibility to define the abstraction level at either the data, policy or file level. The tags can be viewed via the Cloudera Navigator’s lineage diagram, as shown below.
Notice that “salesdata” has been tagged as “sensitive”
In addition to viewing the sensitive information in the Lineage Diagram, users can also search for files containing specific tags for sensitivity (for example, “SSN”, or “HIPAA_Policy”)
This unique combination of Cloudera Navigator and Dataguise provides organizations with an automated, powerful, and comprehensive way to track sensitive data risk in Hadoop. We applaud Cloudera’s efforts in promoting open standards for governance and very excited to be integrating with Cloudera Navigator to deliver this scalable and simple to work with solution in managing sensitive data in Hadoop.