In part one of this series (Hadoop for Small Data), we introduced the idea that Small Data is the mission-critical data management challenge. To reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence.”
We are excluding stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results, because the business objective there is more focused on getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative. This is the realm of executive decision making, accounting, risk management, regulatory compliance, security, to name a few.
We chose this so-called Small Data use case for our inaugural Tech Lab series for several reasons. First, such data is obviously critical to the business, and should therefore be germane to any serious discussion of a information-driven enterprise. Second, the multivariate nature of the data presents a serious challenge in and of itself. Third, the rules and other business logic that give the data meaning tend to be opaque, sometimes embedded deep in operational systems; which means that effecting transparency into this layer can yield tremendous opportunity for the business to fine-tune its operations and grow smoothly.
The Tech Lab was designed to bring the rigors of scientific process to the world of data management, a la Data Science. Our mission is to demonstrate the gritty, brass tacks processes by which organizations can identify opportunities with data big and small, then build real-world solutions to deliver value. Each project features a Data Scientist (yours truly), who takes a set of enterprise software tools into the lab, then tackles some real-world data to build the solution. The entire process is documented via a series of blogs and several Webcasts, which detail the significant issues and hurdles encountered, and insights about how they were addressed or overcome.
All too often in the world of enterprise data, serious problems are ignored, or worse, assumed to be unsolvable. This leads to a cycle of spending money, time, and organizational capital. Not only can this challenge be solved, but doing so will vastly improve your personal and organizational success by having accurate, meaningful data that is understood, managed, and common.
Now, all of this probably sounds exactly like the marketing for the various conference fads of the past decade. We do not need to name them since we all recall the multiple expensive tools and bygone years which in the end did not yield much improvement. So, how can we inject success into this world?
The answer is to adopt what has been working for a very long time and is now a hot topic in data management, namely, Data Science. This new and exciting field (to data management not to science in general) comes with a tremendous amount of thoroughly tried and tested methods, and is linked to a strong community with deep knowledge and an ingrained willingness to help. This is the “science” part of Data Science. Data Science uses the fundamental precepts of how science deals with data: maintain detailed auditable and visible information of important activities, assumptions, and logic; embrace uncertainty since there can never be a perfect result; welcome questions of how and why the data values are obtained, used, and managed; understand the differences between raw and normalized data.
It is the latter tenet that we will concentrate on for this discussion and for the next Tech Lab with Cloudera. In science, normalizing data is done every day as a necessary and critical activity of work, whether experimental (as I used to do in Nanotechnology) or computer modeling. Normalizing data is more sophisticated than what is commonly done in integration (i.e. ETL). It combines subject matter knowledge, governance, business rules, and raw data. In contrast, ETL moves different data parts from their sources to a new location with some simple logic infused along the way. ETL has failed to solve even medium level problems with discordant, conflicting, real-world corporate data, albeit not for want of money and time. Indeed, the types of challenges in corporate Small Data are solved with an order of magnitude less expense, time, and organizational friction and with much higher complexity in many scientific and engineering fields.
One real-world example is a well governed part number used across major supply chain and accounting applications. Despite policy stating the specific syntax of the numbers, in actual data systems there are a variety of forms with some having suffixes, some having prefixes, and some having values taken out of circulation. Standard approaches like architecture and ETL cannot solve this (although several years have typically been spent trying) because the knowledge of why, who, when, and what is often not available. In the meantime, costs are driven up to support this situation, management is stifled in modernizing applications and infrastructure, legacy applications cannot be retired, and the lack of common data prevents meaningful Business Intelligence. Note that this lack of corporate knowledge also means that other top-down approaches like data virtualization and semantic mediation are doomed because they rely on mapping all source data values (not the models or metadata and this is a critical distinction to understand) to a common model.
This is much more typical of the state of corporate Small Data than simple variations in spelling or code lookups. This was for one element among many. Consider this for your company – whether related to clinical health, accounting, financial, and other core corporate data sets – and you can see the enormity of the challenge. This also explains why the techniques of the last years have not worked. If you do not have the complete picture then your architecture does not reflect your actual operations. Similarly, ETL tools work primarily on tables and use low level “transforms” like LTRIM. When the required transform crosses tables, and possibly even sources (compare element A in Table X in source 1 to elements B and C in source 2, and element D in source 3), then it becomes too difficult to develop and manage.
This was the status quo. I say “was” because we now have new tools and new methods that are well designed, engineered, tested, and understood to solve this challenge; and with the additional benefits that they are cheaper, faster, more accurate; and engender organizational cooperation. This is the combination of normalizing data with the computing power of Hadoop.
Data Normalization excels at correcting this challenge and does so with high levels of visibility, flexibility, collaboration, and alignment to business tempo. Raw data is the data that comes directly from sensors or other collectors and is typically known to be incorrect in some manner. This is not a problem as long as there is visible, collaborative, and evolving (remember the tenet that there is never a perfect result), knowledge of how to adjust it to make it better. This calibration is part of normalizing the raw data in a controlled, auditable manner to make it as meaningful as possible; while also having explicit information about how accurate it should be considered. Normalizing data needs adjustable and powerful computing tools. For very complicated data there are general purpose mathematical tools and specialized applications. However, corporate Small Data does not need this level of computing, but does need a way to code complicated business rules with clarity, openness to review, and ease and speed of updates.
This is what Hadoop provides. Hadoop is ready made for running programs on demand with the power of parallel computing and distributed storage. These are the very capabilities that enable Data Normalization to be part of mainstream business data management. One of the key needs of solving Small Data challenges that prior technologies could not provide, is low cycle time to make adjustments as more knowledge is gained and business requirements change (which will always occur, sometimes daily). Gone is the era when data could be managed in six to twelve month cycles of requirements, data modeling, ETL scripting, database engineering, and BI construction. All of this must respond and be in-step with business, not the other way around. With Hadoop, Data Normalization routines in Java programs can be run as often as desired and with multiple parallel jobs. This means a normalization routine that might have taken hours on an entire corporate warehouse can now be done in minutes. Results can then be used in any number of applications and tools.
A simple Hadoop cluster of just a handful of nodes will have enough power to normalize Small Data in concert with business tempo. Now, you can have accurate data that reflects the real business rules of your organization and that adapts and grows with you. Of course, getting this to work in your corporate production environment requires more than just the raw power of Hadoop technology. It requires mature and tested management of this technology and an assured integration of its parts that will not become a maintenance nightmare nor security risk.
This is exactly what the Cloudera distribution provides. All Hadoop components are tested, integrated, and bundled into a working environment with additional components specifically made to match the ease of management and maintenance of more traditional tools. Additionally, this distribution is being managed with clearly planned updates and version releases. While there are too many individual components to comment on now, one which deserves mention as a key aid to Data Normalization, and indeed the Hadoop environment itself, is Cloudera’s Hue web tool that allows browsing the file system, issuing queries to multiple data sources, planning and executing jobs, and reviewing metadata. If you have any questions, comments, or concerns about Small Data please join me live on Webcast II of our inaugural Tech Lab!
Dr. Geoffrey Malafsky is Founder of the PSIKORS Institute and Phasic Systems, Inc. A former nanotechnology scientist for the US Naval Research Center, he developed the PSIKORS Model for data normalization. He currently works on normalizing data for one of the largest IT infrastructures in the world.