Capturing New Data Sources to Deliver Smarter Products and Services

Categories: Corporate Enterprise Data Hub Product

We talk a lot about the enterprise data hub at Cloudera. It’s a technology concept that we believe strongly in. And for good reason, we are seeing the success of our customers in unifying their data systems and launching new functionality and applications all powered by a single technology platform. We are seeing organizations adopt Apache Hadoop to solve a specific problem but then mature into capabilities that no one could have imagined. Some of the best validation is when data analytics and access to good atomic data results in protecting people and saving lives. We believe so strongly in the concept that our internal data architecture is an enterprise data hub and we are using it to get smarter about how we support our customers, deliver our software, and understand our users.

Alan Jackoway is the data platform lead at Cloudera. He manages the operation and scope of our internal data hub and helps work with our data science team and a multitude of stakeholders across the business to launch applications and functionality. Inside the EDH environment, Alan is able to join data across sales, marketing, and customer data sources giving Cloudera the ability to make better choices for our customers. The consistency that unifying data systems creates allows for a constant basis for sound reporting.

I encourage you to watch the full video and hear about all the applications we are hosting from the EDH. Here are some of my key takeaways.      

One Structure, Multiple Applications, We are doing many things with our internal EDH. The data science team at Cloudera uses Impala to perform discovery exercises in order to build the best analytic models, while our support team is running batch processing on ticketing data to identify the most pervasive support issues. Cloudera runs their cluster diagnostics via data sent from Cloudera Manager in order to better support our customers who are troubleshooting issues. Another big use case for our EDH is business intelligence, assuring Cloudera can operate better as an organization to serve their customers and the Hadoop ecosystem.  We now even have a search interface and query across multiple data sources. The point being that each of these activities requires different tools, access engines, and data types to be successful that all need to be addressed by the internal data hub.

Master Data Management, also called data stewardship is an often overlooked role but a critical role for a large scale data strategy. They are responsible for cleansing and unifying data across different data sources. They also help define key values and categories for data to ensure consistency across all efforts. When bringing in data from multiple sources it is rare that they have the same identifiers for common fields like name or sku. This role helps map across all sources to ensure reliability in reporting and that all areas of the organization are acting in concert.

Differing SLA’s. No one likes delays. Especially it is during a critical support experience.  That is why our support team requires data to be real time. Other segments of the business do not require that level of data availability. We have written about this before, this idea of multitenancy. At Cloudera we have multiple SLA’s for data availability. Our support organization requires up to the minute data in order to meet our support commitments to you, while other data sources like log files might only accessed weekly. Sales data is often aligned with our sales milestones and marketing data is often polled during critical events in our evolution. We asked the audience if they had different data access requirements.   


It is likely not surprising that over half organizations require some sort of real-time capability but look at the amount of respondents that only are worried about access within a given week or quarter. We talk a lot in big data about instant actionable insights, but insight creation needs to align closely with how the business is measured. One observation that Alan makes is that by simply having the capability to provide more point in time reference to data we can over time help sectors of the business have increasingly better forecast accuracy.

If you view the full video, you can see a list of all of our lessons learned and our future plans for the internal data hub and we look forward to updating you from time to time. You can also access the whitepaper, which goes into detail on the exact cluster configuration.

We are going to continue the conversation around our internal EDH with a look at how our proactive support capabilities benefit from the data inside the data hub.  I hope you can join us.


One response on “Capturing New Data Sources to Deliver Smarter Products and Services

Leave a Reply