Previously, we announced that the leaders in the data governance space have joined Cloudera to provide a unified foundation for open metadata and end-to-end visibility for governance. Today, we are happy to host this guest blog from Sean Ma, Director of Product Management at Trifacta.
In the last couple years, organizations have dramatically changed the way they store and process data: access to data is no longer the privilege of the few. As more individuals directly access and analyze Apache Hadoop data to drive business insights, the importance of understanding how and why users/applications access data continues to grow. As a result, metadata management continues to play a critical role in effectively working with diverse data in Hadoop.
One of Trifacta’s core strengths is empowering the people who know the data best with an intuitive and self-service way to work directly with diverse data. As users interact with the content of their data, Trifacta translates those interactions into visualizations and recommendations on how the user might want to manipulate the data. With this model, Trifacta automatically generates metadata in-context while the user is executing their work structuring, transforming, and cleansing their data, and ultimately improve the ease of use and productivity of users within the product.
Taking this a step further, we saw an opportunity to share this metadata with the larger Hadoop ecosystem to improve end-to-end metadata management for our customers.
Working with a large European bank, we tackled a common problem within Hadoop deployments – poor understanding of technical Hadoop metadata due to lack of business context. The Bank’s system architects wanted to trace how data wrangled by analysts related to other datasets within the Hadoop system – a problem classically solved by a metadata lineage view. However, the system architects would often point out that while this view was useful, it was incomplete due to lack of context about the business transformation logic that was applied; or rather, missing metadata about the transformation steps executed by data analysts. Therefore, the real challenge was consistently capturing the business logic & context from the data analysts and consolidating it into a single lineage view containing both technical and business metadata. This task was particularly difficult since data analysts were often not trained nor inclined to log into a separate metadata tool to annotate lineage diagrams in a standardized manner.
While Trifacta’s interface is well suited for data analysts, to provide a holistic solution, we needed to pair it with tools oriented toward the work of system architects and engineers. Enter Cloudera Navigator, the customer’s tool of choice due to its ability to collect, organize, and visualize metadata on Hadoop.
In collaboration with the Cloudera Navigator team, our joint integration uniquely augments Hadoop metadata captured by Cloudera Navigator with user-generated metadata from data wrangled in Trifacta. Now, data analysts can easily publish metadata created through the wrangling process to Cloudera Navigator to augment Navigator’s metadata. Additionally, from within Navigator, users can search for the metadata and use Navigator’s lineage view to see Trifacta wrangle scripts directly associated with the datasets on the Hadoop cluster.
Data analysts benefit from having their data quality and data wrangling logic automatically captured and linked to the technical metadata stored within Navigator without ever having to leave the Trifacta interface. System architects benefit from having all of the transformation jobs, HDFS files, Hive tables used by Trifacta users easily searched, traced, and annotated with a human readable description of the transformation logic. This joint solution enables higher collaboration between Line of Business and IT through shared metadata, which results in faster and more successful Hadoop initiatives for the business.
We are excited to roll out this unique integration to Cloudera as part of Trifacta’s v3 launch, and look forward to the future of this initiative and the value we can bring to our joint customers.
Sean Ma, Director of Product Management at Trifacta, brings over 10+ years of experience in enterprise data management software. For the last 5 years, Sean has specialized in building Big Data products and solutions for enterprise customers. Prior to joining Trifacta, Sean led the product management team at Informatica for their Enterprise Data Integration platform and launched their Big Data edition product line. He holds a Bachelor of Science in Electrical Engineering and Computer Science from the University of California Berkeley.