In a recent Cloudera webinar, “The Future of Data Warehousing: ETL Will Never be the Same”, Dr. Ralph Kimball, data warehousing / business intelligence thought leader and evangelist for dimensional modeling, and Manish Vipani, VP and Chief Architect of Enterprise Architecture at Kaiser Permanente, outlined the benefits of Hadoop for modernizing the ETL “back room” of a data warehouse environment, and beyond.
Since then Dr. Kimball, the team from Kaiser Permanente, and a few friends from Cloudera have taken time to answer many of the over 250 questions asked in the live chat.
In this second of two Q&A posts, we’ll look at what it takes to sell the business value of a Hadoop application, tips for managing security in a regulated environment, how to build the right data team, the tools for success, and recommendations for getting started. Enjoy!
- RK = Dr. Ralph Kimball
- KP = The Kaiser Permanente team: Manish Vipani, VP and Chief Architect of Enterprise Architecture; Rajiv Synghal, Chief Architect, Big Data Strategy; and Ramanbir Jaj, Data Architect, Big Data Strategy
Business Value of the Landing Zone
Q: For Kaiser, was the move to the Landing Zone led by the business use cases, or was it the amount of data as you mentioned?
KP: It was both.
Q: What is the business benefit with this newer architecture? Why is it faster to do this via a generic platform (like Hadoop) vs. purpose built system?
KP: The benefits are lower cost, improved performance, and making data available from disparate systems in one place, making it easy to correlate and analyze. Our Landing Zone provide users with a platform for performing quick POC’s and to start leveraging data.
RK: The Landing Zone is a generic platform, with different regions serving different user profiles. It is faster for those qualified clients who are able to immediately ingest the data in the Raw Zone. It also lets us extend SLAs, liberating processing from existing environments at a cost-effective price.
Cloudera: In addition, unlike purpose-built systems, Hadoop’s flexibility enables a diversity of access and analysis on shared data. This has tremendous business value — enabling a broader user community to access and collaborate with data to drive new insights — as well as more IT-centric benefits including lower TCO from an integrated open source platform with common metadata, and reduced risk from managing data security in one place.
Q: Talk to us about cost. Order of magnitude investment levels to transform an organization?
RK: In my experience, these revolutionary investments are mostly made defensively when an organization perceives or fears that their competitors are already doing it.
Q: How would this impact regular data warehouses where data is not in petabytes, or organizations who have many small- to medium-size data sets that, taken together, are big?
RK: The size of the data is not what is interesting about the Landing Zone or Hadoop in general. First it is the variety: data that simply cannot be processed in a relational database. Second it is the velocity and the expectation of immediate access. It has been difficult or impossible to address these requirements in a traditional EDW environment. Also, cost is a factor as well, depending on whether you legacy EDW environment has existing capacity.
Q: What does self-service mean here? Is the business able to work on their own, without IT involvement?
KP: These are design elements for creating a common data platform for all data. It is a 5-7 year journey, with the goal being to enable business users to self deploy, working with IT teams as needed to run analytics and reporting on top of it.
Q: How are you socializing the Landing Zone with business and IT stakeholders?
KP: We don’t talk to them about the Landing Zone directly, but talk about their problems and do a quick POC using this platform to show how they can benefit.
Q: Does this warehouse support medical research? If so, do the researchers also access the data through the same mechanisms of Hive or Pig?
KP: Yes, we support medical research data as of now. The researchers use analytical and reporting tools to access data.
Q: Is this in production?
KP: Yes this is production.
Q: Do you have any objective way of measuring success?
KP: Yes, our success is based on how much data is coming into the Landing Zone from source systems, and also how many uses cases are we providing to solve business problems.
Q: How were you able to hide sensitive data such as patient data, and still give unlimited access to your users such as the data scientist? Tools?
KP: Users only have access to data specific to their use case.
Q: Where does personally identifiable information (PII) protection fit in the picture, especially if we open up the warehouse more?
KP: At Kaiser we follow all Health care standards and security practices to keep data protected. We do data masking at the source systems. When it comes to the Landing Zone, the data is protected, authenticated, and authorized to satisfy compliance requirements.
Cloudera: One of the exciting features of RecordService — our new unified role-based policy enforcement system for the Hadoop ecosystem — is dynamic data masking, which should provider additional flexibility in designing secure analytics environments.
Q: How do you manage data access/security in the Raw Data zone?
KP: We have a separate network with private IP addresses, and Kerberos with Identity and Access Management to manage data access and security.
Q: How does Kaiser encrypt data-at-rest? Is it field level or full disk encryption? Why do you encrypt even non-sensitive information?
KP: We are using Cloudera Navigator for full disk encryption and key management.
Cloudera: Beyond HDFS, full disk encryption can be critical for regulated environments; logs and metadata also exist outside of HDFS and can just as easily contain sensitive information. Cloudera Navigator protects data at the OS/filesystem level.
Q: Are you seeing any performance degradation by encrypting data at rest?
KP: We have not experienced any significant performance impact by enabling encryption.
Q: What are you using for user data authorization?
KP: We are leveraging Kerberos and Apache Sentry.
Building the Team
Q: Describe the user base. What is the mix of power users when the organization has about 20,000 users with some reporting requirement?
KP: We have requirement to service a huge business community on the reporting and analytical side; it is a journey for us. Some community users are happy with their existing environment, and some want to adopt this new platform and benefit from it. We are leveraging new tools for our users to build new dashboards.
RK: Looking at the overall landscape of general business users, probably 5 to 10 percent [have advanced skills to use the new environments]. But in some information-intensive companies there may be dozens or hundreds of data scientists, sometimes in unexpected departments, like manufacturing operations.
Q: Are your data scientists skilled in programming and coding (besides being good statisticians and mathematicians)?
KP: Some of them are, yes.
Q: How big is the team that supports the solution?
KP: We have about a team of 20+ supporting the Landing Zone environment. We have separate ingest and data refinement teams; an ops team providing a 24/7 onsite and offshore support model; plus additional data scientists specific to use cases.
Q: What is the lag time from user request to Refined Zone implementation?
KP: This is not done by any centralized organization at Kaiser. There are experts who decide on use cases and apply resources to projects. A project may span anywhere from 2 weeks to 18 months.
Q: Are you leveraging any knowledge management as part of the process?
KP: We are maintaining our internal knowledge base for processes and challenges. We use data SMEs to help understand the data and perform transformations, mapping, and cleanup.
Choosing the Right Tools
Q: Can you talk about the tools that you’ve used in building your Landing Zone?
KP: The Landing Zone is comprised of the entire Cloudera Enterprise platform — including HDFS, a set of compute frameworks (MapReduce, Impala, Spark, etc.), as well as governance and management tools [Cloudera Navigator and Cloudera Manager, respectively]. We are also evaluating some new tools outside of Cloudera, e.g.: Waterline Data for data wrangling and tagging.
Q: Is it safe to assume that transformations are done as we move from Raw to Refined Zones? If so, what tool/technology are you using for those transformations?
KP: Our pattern is to bring data into Hadoop ASAP, and then perform transforms in-place. For transport, we mostly use Sqoop, Flume, and flat files, with home-grown scripts. For transformation, are mostly using Hive and Impala.
Q: How are you replicating data from Teradata to Hadoop, and how you are keeping that data in sync? How frequently are you doing that?
KP: We use Sqoop for replicating data from Teradata to Hadoop. We have daily and weekly ingests.
Q: When you moved queries from an existing data warehouse (using Teradata) to the Landing Zone, how was the performance comparison on the reports? Don’t you lose the advantage of MPP redundancy?
KP: We have seen 5-9x improvement from time of data acquisition, to integrate, to decision-making and reports. Hadoop does provide redundancy.
Q: How do users access the data in the user defined space? What tool do you suggest for data discovery and exploration?
KP: Users use Hive and Impala.
RK: A whole cottage industry of BI tools work great in the Hadoop environment, particularly accessing the Impala SQL engine, which is meant for rapid response ad hoc querying. These BI tools include all the established players like Tableau, Qlikview, Cognos, and many others.
Q: Don’t BI tools need a traditional high-speed database? Aren’t we giving up a lot of speed in terms of how fast results are returned, when we move from materialized views/cubes to HDFS?
RK: I disagree! While it is hard to beat a small cube store entirely in memory on a hot processor, Impala in Hadoop is purpose built to provide extremely high query performance, especially when built on top of Parquet columnar data files. The BI experience benefits greatly from every increase in performance. By all means please look at the demonstrations by ZoomData with BI tools sitting on Impala on Hadoop. Mind boggling.
Q: What kind of search technology you are using – Elasticsearch or Solr?
Cloudera: Cloudera Search, which is based on Apache Solr Cloud.
Q: Are you using Kafka or related queuing solutions?
KP: Not as we speak, but the Landing Zone is enabled to use it in near future.
Q: What is the primary language you are using at Kaiser: Python, R, or Java?
KP: Java and Python.
Q: How do we start incorporating this new technology with our current EDW?
RK: One path is to offload ETL and/or query processing from the EDW when the EDW (or the OLTP system itself) is over subscribed. Another path is to build an ETL application, in Spark for example, on non standard data that cannot conveniently be loaded or transformed in the conventional environment.
Cloudera: Completely agreed. Many organizations follow a similar maturity model: Start with an operational efficiency use case — saving money on and improving the performance of existing infrastructure by offloading data and/or workloads to Hadoop — while you build your architecture and teams’ skills. You will soon be able to integrate more data, and more kinds of data, with better performance and for lower TCO. Proceed from there to blending new data sets and delivering self-service BI to expand the number of users who can gain value from your data. With data science skills on board, you can begin to use Hadoop’s diverse exploratory and analytic techniques to develop predictive models of customer and channel behavior. Our most mature customers use Hadoop to build data products, often real-time applications that use data to directly impact how the business operates. The key is to start small and prove value quickly, then iterate to success.
Q: What courses and skills can I acquire for a career in this field?
Cloudera: Depending on your focus and interests, there are a wide variety of self-paced online and classroom training opportunities available at cloudera.com.