Hadoop For Small Data

Categories: General

The Elephant in the room is real. Normally, this would be a bad thing as the idiom implies a large object that cannot be ignored but is being avoided. Here, I refer to corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision-making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.

Yet, Small Data feeds mission critical activities whereas Big Data is most often targeting value added functions. I say activities instead of applications because the business, including Government, is mostly concerned with human use of data in concert with automated computer applications to provide services to other people, design and build new things, and manage the business itself for efficiency, effectiveness, security, and finances. Small Data does not meet the entry requirements of Big Data: it is often tens to hundreds of Gigabytes at its largest; it has been well dissected and managed as sets of individual data elements and does not require unstructured images or documents; has a large group of people involved in its lifecycle; and is commonly subject to laws, regulations, and policies.

Small Data has fed 30 years of Information Technology market growth for established companies like IBM, Oracle, Informatica, and Teradata. The market continued to grow to support the expanding use and importance of the data to daily management activities and new automation and analytic applications. However, this was typically done in a minimalist manner with coordination, correlation, documentation, and cohesive management left for a future time when there would be ample resources of people, time, money, and skills to practice full-blown engineering as done in space travel. Most of us are still waiting for the future to arrive.

Enter Hadoop. Hadoop originated for truly big data with its new requirements for very large storage and processing, and the desire to handle this without the very high cost in people, hardware, and software that was the status quo in early 2000’s. Hadoop is a superb computer engineering feat that has significantly pushed the price point down and the accessibility to distributed computing up. It arose from a particular use case, search engine content digestion and querying, and was inevitably tuned for this type of use. This was Hadoop 1 with the intuitively titled MapReduce engine and Hadoop Distributed File System (HDFS).

As with all powerful new technology, people started getting creative and wanted to apply it to other uses. Also, in the United States we still value innovation for profit. Yes, we do. Deal with it.

So, looking for sources of profit, the growing ecosystem locked on to corporate data processing. Enter the need for more general purpose application processing, more flexible data storage, and clearer less computer science oriented user interfaces. Welcome Hadoop 2. This now general purpose distributed computing environment is filled with smartly designed and built components that amazingly work well together. But, gone are the intuitive names with components like Oozie, Flume, Zookeeper, Hive, Thrift, Sqoop, and Mahout. One obvious and excellent analog of the core technology to a long-standing business field is processing vast amounts of data as a modernized form of business analytics for marketing and customer targeting.

Another seemingly obvious use is mainstream corporate data management yet this is a harder sell since the Elephant in the Room also includes the fragile nature of corporate data where the knowledge of how all the data from multiple sources, multiple targets, multiple organizations, and required audit and retention procedures fit together to work is mostly undocumented and unavailable. Indeed, it is this latter point which is the major impediment to solving the corporate Small Data challenge. Oh, and your CFO and CEO actually need to trust it.

Here is where Hadoop 2 offers a dramatic new capability. Solving this vexing challenge requires a cheap, fast, flexible, iterative way to blend a continuously evolving set of business knowledge with unified corporate data. This is Data Normalization. It is the combination of subject matter knowledge, governance, business rules, and raw data, that is, real business management of important corporate data. It is also the foundation of science. Every scientist does this everyday with all their data. All of those great pictures of the universe, or weather reports do not use raw data. They do not simply do ETL. They rely and constantly grow Data Normalization. In contrast to legacy integration approaches, Data Normalization makes data as accurate as possible at any point in time with the ability to improve, adjust, and retarget frequently with minimal effort.

Hadoop 2 makes Data Normalization for corporate Small Data a reality. With its inexpensive, innovative, speedy capabilities for ingesting, digesting, modifying, merging, and delivering data we can now apply Data Science to regular corporate data. The main Hadoop 2 capabilities are distributed storage and parallel processing with little overhead for hardware, software, or personnel. Now, we can marry the knowledge of all people, update in a realistic manner in days as part of normal business meetings and tempo, and synchronize operational data with varied analytical derivative sets: coordinated; correlated; visible; meaningful.

Yet Hadoop2 lacks certain important features that users will be reluctant to do without. These are a primary Graphical User Interface (GUI), widespread support for hardware and software combinations to make installation and maintenance reasonably fast and most importantly does not break anything, and a failsafe way to back out of actions and always be able to retrieve corporate data at any time. I am not talking about database functions. I am talking about managing the IT environment and lowering the barrier to use it in existing business and technical groups with only minimal retraining.

While the Hadoop ecosystem can be acquired and installed completely free from mostly Apache web sites, this will require dedicated and highly expert people. If your organization is willing to support this then you are in an excellent place. But, for widespread use we need an easier way for initial installation and especially for updates. In addition, if you are like me then you are juggling eight tasks at once and getting something done fast cannot involve memorizing arcane command line terminology which differs for ten independent tools. That was fine in 1983 when there was no choice but not today. Approaching Hadoop in this manner is akin to computers prior to Windows 95. Note that Hadoop relies on the underlying Operating System and this is mostly Linux which is itself still very rough to get working properly and keep working through updates and additions.

This is made possible by excellent management tools and a predefined and tested integrated distribution. The leader is this area is Cloudera’s current version 5.1 and its Cloudera Manager. Quickly, Hadoop moves from smartly designed computer engineering into usable, powerful business oriented IT. Cloudera manager provides GUI management for all the components and intelligently aligns them, warns about mismatches and missing dependencies, and provides continuous health monitoring of deployed nodes. It effectively transforms Hadoop from the pain and suffering of getting Microsoft Windows Server 2000 and SQLServer 2000, or any version of Oracle, actually working to the relative ease of where Microsoft has finally reached with its latest versions.

To see how this all works, up close and personal, please join me on Wednesday, Aug. 6 as I demonstrate the peaks and valleys of setting up a Hadoop cluster, paving the way for an Enterprise Data Hub. I’ll be joined by the Bloor Group’s Eric Kavanagh, purveyor of InsideAnalaysis.com, as we launch the inaugural project of the Tech Lab, which will serve as a real-world proving ground for enterprise software.


Dr. Geoffrey Malafsky is Founder of the PSIKORS Institute and Phasic Systems, Inc. A former nanotechnology scientist for the US Naval Research Center, he developed the PSIKORS Model for data normalization. He currently works on normalizing data for one of the largest IT infrastructures in the world.


One response on “Hadoop For Small Data

Leave a Reply