Need for Speed: Parallelizing Corporate Data

Categories: General

Fast. Today. Now. These are the terms of modern business. Data is the fuel of business activities. We need better fuel.

Part two of this discussion series and the last TechLab Webcast (September 17, 2014) focused on Data Normalization as the catalyst for fixing intractable challenges in mission critical Corporate data. Some obvious rebuttal points are: we already do this with our great vendor tool xyz; our ETL vendor says they do this; our database vendor says they do this; our BI vendor says they do this; we are starting a metadata management project to do this; we are forming a committee to talk about this.

If these are real solutions then there should be examples of successful results now, today. Of course, there are some exemplary organizations who have, are, and will succeed in this area. They deserve congratulations, respect, and attention. For many others, this is a work in progress.  So, show me the continuously improving, auditable, meaningful use, trustworthy data being used in mission critical scenarios and openly accepted by leaders as fulfilling requirements well.

A critical success factor is enabling iterative, variable, and transparent results tuned to the personal and organizational work tempo of analysts, managers, and business product delivery. In almost all mission-critical activities, the specific requirements of the business on the data environment are neither static nor known at a level of detail sufficient to supply traditional tools and methods. This leads to the accursed business-technical organizational chasm.

Into this fray comes a little noticed aspect of Hadoop: easily customizable parallel computing. We all know Hadoop provides parallel processing of distributed data. That is why it was built. But, I am talking about real computations (not trivial things like Word Count) that a real CFO needs to have done, done right, done now, and then redone tomorrow. This is where Data Normalization is most beneficial.

Normalizing data entails collecting knowledge (i.e. business rules) from many sources, adjudicating it, putting it into executable computer code, running the computer code with multiple data sources and then analyzing the results. Done well this will lead to new insights that are fed back into the business rules and code to produce new results. At some point, the results are good enough to use to feed business products, applications, reports, and other computational tools for uses as varied as regulatory reports, executive analytics, marketing and customer assessments, and system designs.

Prior to Hadoop, this was only possible with very expensive systems with specially trained personnel. Now, we can do it with inexpensive systems and a wider group of technical and business people. This is because Hadoop is inherently a parallel code environment. More importantly, it is inherently designed to run code in the most commonly used language of Java. We can build, test and run simple to complex programs spreading the workload over many independent CPUs as often as we need, with as many changes as we desire, with as large a set of source data as we want.

Let us use an example from the showcase data set in the TechLab series with Cloudera. This is the Federal Procurement Data System (FPDS) and associated IT data sources. The following diagram shows how we are normalizing key tracking data values in FPDS to other sources to more accurately analyze the types, costs, and owners of IT services and products. There are several supposedly unique identifiers but in practice they are anything but unique and definitely not accurately assigned. This stymies traditional analytics, and even Data Quality methods, since there is no clear, documented, nor known rules to correct the values. In contrast, this is the sweet spot for Data Normalization.

Building Key Relationships and Normalizing Record Data Across Multiple Overlapping Data Sources

We want to improve tracking and accuracy by correlating the names and accounting information from multiple data sources using two common identifiers. These are the DUNS (Dun and BradStreet number), and the CAGE code (used by US Federal Government). Analyzing the various sets we determine that the CAGE is the most unique and accurate identifier for companies. The DUNS has many numbers per company that vary over time (mergers, acquisitions, name changes), and also has a Global DUNS that is supposed to be a parent of a set of regular DUNS. One source has CAGE and DUNS, while FPDS has DUNS and Global DUNS. Both data sources have multiple names per company. Both data sources have multiple accounting entries per event per company.

Attempting to find, review, and extract the relevant correlated data from this relatively simple use case is beyond most data management systems. Each record in FPDS must be correlated to each record, each DUNS, each CAGE, and each name in the second source. Open questions are the degree of differences in names (just a comma or space?) and if we should automatically consolidate them, the level of confidence in relationships and whether they can be used in all cases or just some to combine records, and many others including ones to be realized after analyzing the data.   Thus, we do not have independent data that can be run in Hadoop Map-Reduce jobs. We do have intertwined data that has so many relationships and nested checks that it overloads even a powerful computer’s resources.

In my prior life as a research scientist, this is where I would try to get time on a super computer. Instead, I now write Java the same way I do for any program, including testing various ways to cleverly maintain fast memory based lookups of values (i.e. hash tables) so that I can efficiently process the entire set, review it, make changes, and do it again and again until I am confident I have good answers and a new normalized set of data that I can share widely. I can also use natural boundaries in the data to define parallel jobs. In this case, the data was segmented by fiscal year. I ran a pre-process to break up the original data into files based on fiscal year, but this pre-processing could easily be part of the main job and therefore run at one time. Indeed, this is another benefit of this approach in that I can simply construct the most efficient way to handle the data for my own needs, and change my mind without a lot of work.

Each job uses a separate set of CPUs, separate memory, and separate disk storage. I ran this on a small Cloudera cluster specially configured for computational needs. Just two nodes has 15TB of disk storage, 18 virtual CPUs, solid state storage, and access via Cloudera’s manager and the HUE (Hadoop User Experience) integrated web site. With this simple arrangement, I could run six parallel each using a 20GB lookup source on its own 3-5GB data set in about thirty minutes. This is fast enough for my needs although it can easily be sped up by breaking up the fiscal year sets into one-half or one-quarter years and then running more parallel jobs using the unused vCPUS or adding more nodes to the cluster.

This worked so well that I reran entire jobs just to make the collected statistics (I always do this with my analytical code) more compact and to include a wider array of information. I have used many tools, programming languages, and methods over the years and this is easily the most powerful and flexible, and the one fitting my natural workflow. I do not pine for the hours of writing customized routines in proprietary languages in what was supposed to be a no-coding analytical system.

There is another aspect of this that is very important. It is fine to write code to do real-world complicated Normalization and analysis, but we also need an easy way to submit, monitor, and manage the jobs in Hadoop. This is the last piece of today’s discussion. Cloudera’s Hadoop distribution not only has the requisite basic Haddop components, and the very good HUE web page I commented on last time, but all the additional components that make it easy and reliable as a workhorse computing environment. This includes REST APIs for HDFS, multiple interfaces to YARN, and the most important of all which is that the parts work together, can be visibly monitored at all times, and actually have meaningful and understandable log files so tracking issues is part of normal development and not a separate source of frustration.

This would be enough to make the Cloudera distribution my preferred analytical environment but there is more. We use our own Hadoop client built into our Data Normalization platform. Integrating this with Cloudera’s distribution was quick, easy, and reliable. This opens the important opportunity for a powerful expanded set of tools to build on Hadoop, and do so quickly. I will show more of this in the next TechLab Webcast in October. See you then.


Dr. Geoffrey Malafsky is Founder of the PSIKORS Institute and Phasic Systems, Inc. A former nanotechnology scientist for the US Naval Research Center, he developed the PSIKORS Model for data normalization. He currently works on normalizing data for one of the largest IT infrastructures in the world.


3 responses on “Need for Speed: Parallelizing Corporate Data

Leave a Reply