This blog was penned by Dr. Geoffrey Malafsky who is Founder of the PSIKORS Institute and Phasic Systems, Inc.
In our TechLab series with Cloudera we have been exploring the ability of Hadoop to make a strong positive impact on the challenge of mission critical data, what I call Corporate Small Data. In true laboratory mode, this topic is approached as a question not as a known answer. Thus, TechLab is less marketing and more joint exploration of the intersection of technology and business.
We must always stay vigilantly focused on real business value and not succumb to “toolitis”, the medically known (certification pending I believe) condition where organizations buy expensive tools with little chance they will solve the stated business objectives. As with biological microorganisms, these invading tools come in multiple forms including Master Data Management hubs, ETL systems (if they claim to solve all data needs), Metadata Management, and the most virulent strain of Semantic Technologies.
Of course, I am a hard core technologist but have learned the lesson many times, especially in my research scientist days, that the real path to success requires producing meaningful analytics that are clear, auditable, and yield insights to important issues. So, while playing with new technology is fun, and can be used to delay and digress for a year or more, anyone who cares about or needs decision quality data analytics should embrace the adjective “meaningful”.
This is the theme of this article and the final TechLab broadcast with Cloudera on November 5th 2014. The first and most important question is what is a meaningful analysis and how does it differ from other analyses? The answer is: it depends. This is not the answer some people want but those people have absolutely no business being involved in Data Science or advanced Business Intelligence. If you have to question why “meaningful” is a crucial characteristic of decision quality data analytics then you should not read any further as I will not only talk about it as the central driving requirement of all procedure and technology activities, but describe the futility and near sabotage of doing analytics without a clear plan and metrics for gauging the meaningfulness of the results.
A meaningful analytic product is one that presents information in a concise and easy to understand manner where the underlying data has gone through an iterative series of investigation, normalization, review, and most importantly adjudication. It does not have to be perfect. Hopefully I got that stated before the refrain of “we cannot wait for perfect results” and its cousin “something is better than nothing” are shouted out. There is no such thing as perfect data so let’s not waste time on that tired cliché. But, let’s spend some time on the second tired cliché.
Something is not always better than nothing. Nothing is actually better if something is wrong and will lead people to take wrong actions based on wrong data or misleading graphical results. In real science, we do not wait for perfection. We do produce meaningful results with what we collected so far, what we know about the data and our assumptions towards the objective at a point in time, and with indicators of these aspects of the data and its pedigree and underlying assumptions that can be questioned and revisited at a later time. Sometimes these are the error bars in statistical results, sometimes they are alternate charts that answer “what if this other pattern exists”, and sometimes it is just a label at the top of a chart that says “Current Results, Additional Work Underway”.
With previous generations of technology, it took a lot of preparation to both understand the data and build a repeatable approach to updating it. This could consume most of an analyst’s time and require specialized training in multiple tools and coding methods. It could require working with other groups that controlled the technology, the knowledge surrounding the metrics of being meaningful, and the conduit to show the results to the people who might use it for important decisions and actions. These were significant hurdles; some still are. But, the technology part of this challenge is no longer an impenetrable roadblock to efficiency and success.
Cloudera’s Hadoop distribution provides a complete management and execution environment for parallel processing and very large data storage. This seems perfectly suited to enable the iterative, constantly updated, multiple method analysis needed to explore, understand, and digest data into meaningful products all within the daily tempo of modern organizations and analysts. So, is it?
The answer is yes with a footnote. With its simple parallel processing environment (i.e. relatively speaking as parallel processing is never really simple), ability to store immense amounts of data with little overhead and a trivial amount of new technology knowledge, easy scaling up and out, and its cluster management, Cloudera’s Hadoop environment is an enormously powerful capability for meeting the Corporate Small Data need. The footnote concerns an organization’s willingness to spend a little time, effort, and funds to learn something new and adapt business activities. Yikes! Change! I feel faint.
Most of the information about Hadoop being promulgated in conferences and in literature is either “it’s perfect so buy me” or “get a computer science degree and join our open source community”. Compared to established analytical tools, the learning barrier to using Hadoop in the near-term for executive level results is low. You need to have people comfortable with Java but that is not difficult. Also, you need to ignore most of the hyperbole and use it in a narrow use case to meet hard, important challenges. That is a little harder since it requires distilling all the things occurring in your organization now, all the things that could be done, all the things someone wants to do that does not seem to make sense, and all the things that are important into a small set of things that are important, clearly showcase value, and also keep groups working together.
The path to meaningful data analytics does have a hill to go up. It is a small hill but a hill nonetheless. At the top is data success. I’ll be waiting for you with refreshments. See you Wednesday, 5 November 2014 for TechLab3 with Cloudera.
Dr. Geoffrey Malafsky is Founder of the PSIKORS Institute and Phasic Systems, Inc. A former nanotechnology scientist for the US Naval Research Center, he developed the PSIKORS Model for data normalization. He currently works on normalizing data for one of the largest IT infrastructures in the world.