In a recent Cloudera webinar, “The Future of Data Warehousing: ETL Will Never be the Same”, Dr. Ralph Kimball, data warehousing / business intelligence thought leader and evangelist for dimensional modeling, and Manish Vipani, VP and Chief Architect of Enterprise Architecture at Kaiser Permanente, outlined the benefits of Hadoop for modernizing the ETL “back room” of a data warehouse environment, and beyond.
Since then Dr. Kimball, the team from Kaiser Permanente, and a few friends from Cloudera have taken time to answer many of the over 250 questions asked in the live chat.
In this first of two Q&A posts, we’ll explore the future of Hadoop and the data warehouse, and matters of data modeling and data governance. Enjoy!
- RK = Dr. Ralph Kimball
- KP = The Kaiser Permanente team: Manish Vipani, VP and Chief Architect of Enterprise Architecture; Rajiv Synghal, Chief Architect, Big Data Strategy; and Ramanbir Jaj, Data Architect, Big Data Strategy
Hadoop and the Enterprise Data Warehouse
Q: How does the “Landing Zone” differ from the “Data Lake” concept?
RK: The Landing Zone is a specific area of the DWH with subzones intended for specific purposes and clients. The Data Lake is a term thought up by the media because they don’t know how to build a DWH.
Q: Does it make sense to consider Hadoop as an ETL solution even when all your source data comes from OLTP databases?
RK: Yes. To me the point of the Landing Zone and performing ETL workloads in Hadoop, is to take advantage of the speed, flexibility, and cost advantages of the Hadoop environment. So taking that point of view, you want to put as much ETL work as you can into Hadoop. Even if your data consists of familiar text and numbers in a relational format, Hadoop is a cost effective environment for offloading your ETL transformations from the EDW and then uploading the results when you are done. Besides Hadoop SQL clients including the open source Impala and Hive SQL engines, there are multiple proprietary SQL engines. Also the major ETL tools play effectively in Hadoop. Informatica PowerCenter jobs, for example, look essentially identical in the Hadoop environment, and have deep hooks into the Hadoop environment.
Q: Is the data warehouse also in Hadoop, or is it on an RDBMS?
RK: You can do full traditional data warehousing on Hadoop. Please see our first webinar, “Hadoop 101 for Data Warehouse Professionals. In general the DW can be either in Hadoop or in the conventional EDW, or both. The Landing Zone doesn’t mandate an answer to your question, as it is still the “back room” and data in general is exported to traditional clients like the DWH, wherever it is.
KP: Hadoop is a cost-effective way for offloading ETL transformations from the EDW or OLTP, and then uploading the results when done. Our traditional RDBMS (OLTP systems) will still exist for now.
Q: Big data platforms seem to be focused on analytics, but a traditional DW will still be the source for 90% of reporting. Do you agree?
RK: The traditional DW will always be the main source for reporting off of OLTP databases. Hadoop will play several important roles however, including offloading the reporting purely for the cost advantages, but also for those forms of ETL on non text-number data sources.
Q: In the near future, are we going to see real time reports — as soon as the data is fed into the operational data sources — without need for a data warehouse?
RK: There is no such thing as a free lunch. Schema on Read still requires all the logic for building the “views” that implement fact tables and dimension tables, surrogate keys, slowly changing dimensions, and conformed dimensions. Many data cleaning, data matching, and foreign-key to primary-key structures must still rely on physical lookup tables because the detailed logic is too lengthy to embed in a query by itself.
Q: With sufficient scale in the Landing Zone, should be able to automate a great deal of traditional ETL?
RK: Absolutely. Schema on Read allows the developer to query and process data from original diverse sources without physically rehosting the data. But like any data virtualization, the downside is that Schema on Read may be slow, and that once a virtual ETL pipeline is built and validated, the developer may then re-implement the transformations in the form of physical tables in order to gain the best performance.
Q: Are we at same stage with Hadoop where we were with RDBMS in 90s?
RK: There are some similarities with the RDBMS marketplace of the 90s but some huge differences. The similarities I mentioned in the webinar include end user departments building stovepipes by themselves without IT involvement. But the differences include MUCH more powerful processing power and analytic tools, greatly expanded data types, the ability to handle huge volumes of data, and the ability to stream data into Hadoop at enormous speeds. These last two were really serious limitations in the early days of data warehousing.
Q: What is the future of RDBMS for OLAP?
RK: Relational databases, and hence SQL, will remain the dominant API for business intelligence. The OLAP world is still fragmented by comparison and the original performance advantages of OLAP tools are not so obvious any more. If you mean by OLAP the new non text-and-numbers kinds of data, then I would agree that RDBMSs will not be able to address these new data types.
Q: What forces data export out of HDFS to DM/DW? What capabilities does the Hadoop ecosystem lack today that forces moving data out?
RK: In my opinion, high performance transaction processing has not been a top level goal of the Apache Hadoop project. Where Hadoop complements high performance OLTP is offloading the ETL work from these critical machines.
Q: What is your range of data sources, e.g. structured vs. unstructured?
KP: Currently it is 95% of structured and about 5% is unstructured. We are working on bringing in voice and image data which will change the mix to around 80% structured, and 20% unstructured.
Q: How should we do data modeling in the new “back room”?
RK: The whole point of the Landing Zone is that unchanged raw data can co-exist with highly curated data which is intended to be exported or analyzed by very specific clients. Manish did a great job of describing the various parts of the Landing one that illustrate the flexibility. There is no single mandated data model for the Landing Zone. The metadata description in HCatalog of a source is open to all clients, who are then free to interpret the contents for their own purposes including extracting data elements that an RDBMS does not know how to process.
Q: For warehousing in Hadoop, I keep hearing, “don’t model stars, don’t normalize.” So what is the warehousing approach for Hadoop?
RK: Star schema modeling is still the fundamental API for data warehousing in Hadoop. The subtle point is that Hadoop provides two approaches to exposing this API. The first is through Schema on Read which trades off up front ETL data movement for more computation at query time. The second is the familiar building of high performance (e.g., Parquet) physical tables in advance of actual querying. Either way, the DW BI tools see the same schemas. The first approach is immediate but slow, and the second approach is deferred but fast.
Q: Is this a compromise of traditional DWH principles?
RK: The traditional DWH architecture remains perfectly intact. Fact tables, dimension tables, surrogate keys, Slowly changing dimensions, conformed dimensions. But where we process and build these structures can be be changed, or we can just implement the usual ETL physical transformation steps in a more cost effective way. Please see the first two Cloudera webinars I did last year that try to nail these issues.
Cloudera: See “Building a Hadoop Data Warehouse: Hadoop 101 for EDW Professionals” and “Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals”.
Q: But do we see conformed dimension and facts as being dead?
RK: Absolutely not! Conformed dimensions are the only practical way to implement integration across diverse data sources whether we are in the EDW and Hadoop. This simply means that the dimensional attributes (or equivalently “descriptors”) attached to separate data sources must be drawn from the same domains and have the same underlying semantics. The data sources don’t need to be relational but if you want to tie them together, you need the whole story of conformed dimensions.
Q: If the physical transformations are optional and without doing the cleaning and normalization, how do you integrate data from different enterprise systems in the new “back room”?
RK: One of my favorite questions. We dealt with this issue of normalization extensively in the second Cloudera webinar I did with Eli Collins. Please watch this webinar. But there is a subtle point in the Schema on Read story. You still must do necessary transformation steps to use the data, but in the extreme case, you do it by computing essentially a complex View declaration at query time. When you decide that this is too slow, you revert to traditional ETL processing, such as loading Parquet columnar data targets. Finally one more point. No matter what your data source or what you processing environment (Hadoop or EDW) you still have to maintain a correspondence table between individual source natural keys and your data warehouse surrogate keys. You can’t avoid this step if you expect to integrate data sources into the data warehouse.
Q: At some point in time in the processing of any data one has to know the schema – why be deliberately lazy with schema?
Cloudera: Because it allows data to be loaded at any time, and schemas to be quickly iterated upon.
Q: With schema-on-read methodology, it seems to be up to the data scientist to determine how they wish to consume the data. Yet I’ve been reading a lot of articles where they are now spending 80% or more of their time trying to understand and prepare data, and less than 20% of their time actually doing analytics. What are your thoughts and insights to this?
Cloudera: Strongly typed schemas can be defined by cluster admins or data owners, exposing SQL tables to a broad range of users. This is in fact the most common way data is exposed. In some cases, end users can get involved in defining the schemas, as you suggest.
Q: We’re trying to build star-schemas with Hive, Impala, etc. What will be the challenges?
RK: The challenge for star schemas in a shared nothing environment like Hadoop or Teradata is processing joins. Look to Impala particularly for techniques of making join intensive schemes work well.
Q: How do you add primary key identifiers across a distributed environment for when you want to move from the open access data to the controlled final data?
RK: Assigning unique keys across a profoundly shared-nothing environment is a hard problem. Probably the most straightforward way is to partition the keyspace in advance so that each node has its own key range.
Q: It seems like the materialized Parquet views may help with speed, but how do we control how measures are aggregated as the user moves up a dimension hierarchy, like we can do with traditional cubes and MDX? This would especially seem to be an issue with non-additive and semi-additive facts.
RK: The issues with building Parquet files at an atomic level or at an aggregated level are exactly the same as building such files in a conventional EDW. Parquet is simply the Hadoop equivalent of a columnar data store.
Q: How do you handle SCD Type II in Hadoop?
RK: There are two fundamental steps for handling time variance in dimensions (mainly SCD2). The first is to capture the change in the dimension member, and the second is to append a new record to the dimension with a newly assigned surrogate key. The brute force way to do this in Hadoop has been to write a new dimension table each time you perform this append operation. Until now, the fundamental rule in Hadoop has been that an HDFS file cannot be updated. Taken literally, this makes some steps awkward because an file needed to be rewritten to perform the update. However, Kudu, recently announced by Cloudera, directly addresses this issue. Please study the Kudu announcement to visualize how this aspect of the Landing Zone will be made easier.
Cloudera: Read more about Kudu here.
Q: How do we implement surrogate keys in Hadoop?
RK: The brute force solution in Hadoop until now has been to implement the dimensional surrogate keys in the EDW and export to Hadoop. Then implement a key correspondence table for the fact keys and use it at query time in Hadoop, or else rewrite the fact tables to add the surrogate foreign key. I am excited by the newly announced capabilities of Kudu however, which may possibly change this whole game since in Kudu tables can be modified.
Q: How does this approach address the prevailing problems associated with multiple data silos – e.g. multiple customer files that are similar but not exactly the same?
RK: The fundamental problem of incompatible data sources does not magically disappear with Hadoop. At the root, you still need a well maintained correspondence table between each source and the corresponding master dimension table describing the entity (say Employee or Customer) and you need an ETL step (virtual or physical) that updates this master table with programmed business rules. We aren’t saying that all users are being invited into the back room (remember the hot liquids and sharp knives kitchen metaphor). We are carefully vetting certain advanced users that aren’t being served by the traditional ETL processes and RDBMS targets. The traditional business users may well not see a big difference.
Q: Have there been problems with inconsistencies in final results between, say, a model built by a data scientist, as compared to an executive running a query with simple aggregations, on the same subject?
RK: This is a variation on the theme of incompatible stovepipes. We want the data scientists to explore new approaches and probably get somewhat different looking results. But when the data scientist’s results go mainstream, then the hard work begins where IT and the data scientist need to reconcile the different approaches. This is a serious issue, since senior decision makers lose confidence when these differences can’t be explained.
Q: Do you see any concerns of exploding multitude of versions of the same data created by different users in the “User Defined Space”?
KP: Yes, but data in User Defined space is a scratchpad mostly used for people to play, and is an ungoverned space. Ultimately, things have to go through Raw Zone and Refined Zone under a controlled processes, enabling us to have a single version of data.
Q: What are the criteria for deciding whether a personal data feed should be standardized for general consumption. What are the cost/benefit criteria?
RK: This sounds like the decision as to whether to take a data scientist’s experiment or prototype and put it in the mainstream. It is at that point that IT must be involved at least to certify the data scientist’s application.
Q: With all this data coming in from many places, how do you ensure quality and consistency?
RK: Well of course there is usually no guarantee of consistency at the original sources. No matter where you do the processing (EDW, conventional ETL in Hadoop, schema on read in Hadoop) you can’t avoid the heavy lifting of cleaning and conforming. Having said that, there is a class of sophisticated data consumers (not traditional business users) who insist on access to original “dirty” data. Yes, I know that is frustrating to us DW folks, but these needs are in carefully managed cases legitimate.
Cloudera: Consistency is managed by validation routines and processes run before data is promoted from the Raw Zone to the Refined Zone.
Q: Does Hadoop takes care of transforming and cleansing data?
RK: Hadoop is a general purpose applications environment typically coupled with the Hadoop Distributed File System (HDFS). Data is ingested with tools like Sqoop and Flume, and the data is often stored in raw form. Many programming languages and application development systems are available to analyze and transform the data. Thus Hadoop itself does not transform or cleanse data.
Cloudera: Hadoop itself is a platform. Users may write code or scripts to manage ETL, or use a third-party tool to make the process even easier.
Q: Do you collect, manage and share the metadata around all these disparate data sources to drive reliable analysis?
KP: Yes, the users have information about metadata specific to their use cases. Also they may choose to publish that for other use cases. We are working on some data wrangling and profiling tools to further enrich our data.
RK: One of the strengths of the Hadoop environment is that there is a central metadata source describing file structures, namely Hcatalog. But I think you are tapping into a bigger and more profound issue, which is the totality of metadata, and where that is stored, and whether there are universal formats and standards. This is a huge work in progress in the Hadoop community and at Cloudera. I think I mentioned in the webinar that I will be following these issues and hope to do a webinar in 2016 describing some significant advances, and maybe making some recommendations, too!
Q: How do you you handle testing and data validation, for example checking for bad data?
RK: This has always been a classic data import/ETL issue. Ultimately you need to tie back to the source for record counts and metric totals to do sanity checks on whether you got everything. The basic architecture and requirements for testing and validation remain unchanged, keeping in mind the flexibility we are offering for Landing Zone users from data scientists who don’t want us touching their data to business users who expect the same standards of trustworthiness that they have always had.
KP: We validate routines and processes, and do refinements before ingesting data in the Raw Zone.
Q: How do you handle incremental loads/CDC in your system?
KP: We use Sqoop for data ingestion with a time-stamp column for performing incremental loads.
Q: What is the volume of data handled by Kaiser?
KP: We have a 100-node cluster with 2.5PB for the Raw Zone alone.
Q: Does the data stored in the Raw Zone kept for long-term storage? Or is it transient?
Cloudera: Typically data is not evicted from the Raw Zone — to support reprocessing or inclusion of additional attributes not originally imported into Refined Zone.
Q: It will take some extra cost to keep data in the Raw Zone. Do you suggest any retention policies?
KP: At Kaiser, we have a mandate to keep 7 years worth of data.
Q: Is the landing zone backed up?
KP: We are currently backing up metadata only.
Q: With the new back room on “doors open” approach, what do you do where auditing and compliance are mandatory, and require governance?
KP: We use Cloudera Navigator for data governance: It provides security, auditing, lineage, tagging, and discovery. Also we have enriched our internal processes around data governance and auditing.
Stay tuned for part two, where we’ll look at what it takes to sell the business value of a Hadoop application, tips for managing security in a regulated environment, how to build the right data team, the tools for success, and recommendations for getting started.