Authors: Mai N. Nguyen, Accenture & Mitch Gomulinski, Cloudera
Imagine storing the DNA of the entire population of the US – and then cloning them, twice. That’s the equivalent of 1 petabyte (ComputerWeekly) – the amount of unstructured data available within our large pharmaceutical client’s business. Then imagine the insights that are locked in that massive amount of data. As the content resides in multiple sources, each with its own limitations, aggregating and drawing information from that content can be daunting. How can they tackle this challenge?
Unstructured content lacks a predefined data model; it must first undergo text extraction, classification, and enrichment to provide intelligence. With over 1 PB of research data stored in a variety of systems (Documentum, Window and Unix file shares, and SharePoint) and a wide range of content types (internal reports, emails, research documents, electronic lab notes, drug profiles, clinical trials, regulatory reports, images, etc.), the client needed an approach to:
- Simplify data hub ingestion, especially for large volumes of unstructured content
- Ensure content can be reused within the data hub to support pharmaceutical use cases
Using Aspire as a Cloudera Parcel
The solution to this massive data challenge embedded the Aspire Content Processing Framework into the Cloudera Enterprise Data Hub as a Cloudera Parcel – a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager. Aspire, built by Search Technologies, part of Accenture is a search engine independent content processing framework for handling unstructured data. It provides a powerful solution for data preparation and publishing human-generated content to search engines and big data applications.
- Aspire as a Cloudera Parcel, available in the latest 3.2 release, can support unlimited scalability and enterprise security requirements, and can communicate with the data hub for content storage and indexing natively. It enabled the ingestion of over 1 PB of unstructured content into the data hub with a peak rate of over two million documents per hour.
- Combined with HBase, the framework allows for unlimited scalability to fulfill increasing enterprise needs. Aspire as a Cloudera Parcel uses a big data database, making installation and maintenance simpler in the client’s big data environments.
- Aspire connectors can acquire binaries, metadata, and access control lists related to content within the enterprise data systems. Parallel content acquisition can deliver high-performance throughput rates, allowing for very efficient content ingestion from multiple, disparate content sources
- A staging repository is central to this architecture as it supports highly-efficient content reuse and continuous updates. This enables data hub users to quickly access up-to-date content across the enterprise.
Using a search engine to support all stages of a data hub project
Using the Solr search engine to index the data hub content initially helped the company better understand the data ingested. And a Cloudera implementation is well-suited to support a large number of Solr indexes to meet diverse use cases. A first cut analysis can enable a review of the content in multiple dimensions:
- By type: a single unstructured data source may include different types of content (reports, memos, or lab experiments) as well as some low-value information like logs or working files
- By use case: content may also support multiple research and drug manufacturing use cases
- By provenance/authorship: knowing the department, lab, or researcher who generated content is essential
- By publishing/creation date: analyzing the data hub content by date can help understand historical trends and developments
With Cloudera Enterprise Data Hub as the foundation for data acquisition, centralization, and indexing, more intelligent applications can be built on top of it to support:
- insight discovery
- compliance reporting
- other search and analytics needs across the organization
For a deep dive into the challenges and potential of handling massive quantities of diverse content with Cloudera, join our upcoming webinar.