Q&A with Greg Rahn – The changing Data Warehouse market

Categories: Data Warehouse

Hi Greg, thank you for joining us today. I would like to start off by asking you to tell us about your background and what kicked off your 20-year career in relational database technology?

Greg Rahn: I first got introduced to SQL relational database systems while I was in undergrad. I was a student system administrator for the campus computing group and at that time they were migrating the campus phone book to a new tool, new to me, known as Oracle. I was part of this migration project, and then after undergrad, I went on to be a software engineer for a utility company, who was using DB2 on the mainframe and migrating to Oracle on Unix. So, that’s kind of how I got introduced to databases and SQL systems. And then I moved from  Madison, Wisconsin to San Francisco in 2000, to chase the dotcom dream. I then ended up working for a travel company and did database administration there. After having rebuilt their data warehouse, I decided to take a little bit more of a pointed role, and I joined Oracle as a database performance engineer. I spent eight years in the real-world performance group where I specialized in high visibility and high impact data warehousing competes and benchmarks.

Greg Rahn: Toward the end of that eight-year stint, I saw this thing coming up called Hadoop and an engine called Hive. It kind of was interesting to me that there were these big internet companies in the valley running this platform or a variation thereof of, based on Google research papers. And so I actually transitioned out of that group and into the Big Data Appliance group at Oracle, but soon realized that if that was what I wanted to keep doing, this up and coming company called Cloudera might be a better place to do it since these new technologies weren’t just a hobby at Cloudera.  I decided to jump ship in May of 2012 joining Cloudera. At that point, there was no publicly known project called Impala. It was still a skunkworks project if you will, and so given my database background, I got contacted by Jeff Hammerbacher and some other folks, including Marcel Kornacker, to help out a little bit with the project. And so that’s how I got into Impala at Cloudera.

Michael Moreno: Nice! So, that sounds like a long pathway to get here. Certainly interesting to hear that your background started in the dotcom age as I actually worked at a dot com back in 2000. Interesting times.

Let’s talk about big data and Apache Impala. How did the connection between the two come to exist? So many of our readers might not be familiar with Apache Impala. Can you provide some context on how Apache Impala came about?

Greg Rahn: Sure. So, I think the prominent engines at the time for data processing on top of data residing in HDFS was Hive, and Hive was basically a SQL to MapReduce program. It took SQL text and converted into this big MapReduce job to run on these big clusters. It was very different from the traditional MPP SQL engines from, say, Oracle or Teradata. A number of folks in the database community saw this approach and understood why it was coming about. Yet, there were some  limitations in MPP at the time, because some of these systems running Hive were quite large, and the database community thought that instead of the future being Hive on MapReduce or something similar, that we could extend, bend, and change the MPP engines to actually operate in a more scalable manner on such large data.

Greg Rahn: So, folks like Mike Stonebraker made comments that Hive, and this MapReduce approach, was already outdated and there’s better technology that the RDBMS community can help drive.  I was actually a strong believer in that approach, having spent eight years at Oracle and worked with relational data, I understood that there is a lot to borrow from RDBMS technologies and theories,  and a lot of advancements that could be made. And it [MPP] is a much more efficient execution model. That was that same thinking that Marcel Kornacker had when he started the Impala project. His background was from Google working on the F1 query engine, which was kind of the successor, if you will, to some of the early data crunching applications like Hive. This [F1] was more on the query side. And so Impala was really about taking the experience of these big MPP systems on top of distributed file systems and moving that into an open source project for the world to use.

Michael Moreno: That’s great. Could you elaborate a bit more about the differences between MPP and distributed systems? There may be some people who don’t quite understand that dynamic.

Greg Rahn: Yeah, so I think the biggest difference is if you take an MPP style database, say like a Teradata, and compare it to an MPP query engine like Impala that runs on top of a distributed file system, is Teradata ships it’s query compiler, query catalog, the execution engine, all in the Teradata box. In the Hadoop world, or the big data world, most of these components are separate and modular, but yet interact together to form a system that behaves very similarly. For example, Impala uses the Hive metastore catalog as its data dictionary and it operates directly on data existing in HDFS, which is found through the Namenode API. So, they have the same components: a catalog, query compilation, query execution, and file management and storage, yet they’re independent of each other.  That’s what I think the big difference is. But many of the execution tricks that SQL engines use, like Teradata, are being implemented in Apache Impala.

Michael Moreno: Great. How do you see Apache Impala being used today relative to other, maybe competitive, platforms? Where do you see Apache Impala fitting in?

Greg Rahn: Apache Impala is the de facto standard, I think, for fast analytical SQL queries on data in HDFS, or even an object store like S3 or ADLS. We have customers at Cloudera that are running on tables, and even, I would say a partition of a table, that is larger in scale than the things that I saw during my tenure at Oracle. Say, circa 2004 when I started at Oracle.  Oracle used to have a group of customers that had the largest amounts of data, and that group of customers had a nickname called the Oracle Terabyte Club. So if you had a terabyte or more of data in your Oracle data warehouse, you were a big customer in 2004. And one of the systems that I worked on benchmarking in 2004 was 70 terabytes. That was one of the biggest systems that I’d ever heard of in the relational database world at that time.  Fast forward 14 years and here we are with customers who have 750 plus terabytes in one single table.

Greg Rahn: It seems just astounding that we have an order of magnitude more data in a single table in 2018, than an entire company had in their single largest database in 2004. That is just something — operating at that scale is not something that the traditional relational systems, I think, can handle very well right now. This is why folks are using Apache Impala and the Hadoop platform because they don’t have an alternative to run on at that scale. It really does say something about the scale that Apache Impala handles today.

Michael Moreno: What about your average RDBMS administrator? Let’s say they’re not working with large-scale data sets. Is there a place for Apache Impala under those conditions?

Greg Rahn: Yeah, so I think there’s a couple of things that differ from the relational platform, if you will — the traditional RDBMS, compared to something like Impala over HDFS. Scale is certainly one aspect, but even on smaller data sets, I think the flexibility of schema on read, and being able to land data immediately no matter what the shape or form, is very attractive.  Right now, if you look across the marketplace, there’s probably not too many places where people don’t say that data is a competitive advantage. Data gives you benefits. It is how you beat your competitors, right? Everyone wants to be a data-driven company — traders, retailers, advertisers, etc., you name it.

Greg Rahn: But operating with speed and agility requires flexibility from a platform. With a traditional relational database, you have to do all the upfront data modeling and have very clean data before it even gets into the tables. Conversely, on a big data platform, it’s very easy to land data no matter what. There are no restrictions. It’s basically just files at first, but at least you can land it in a place where you can operate on it at scale. In the old days, you used to have ETL where you always published clean data to your database system, and ETL was a separate system. Generally, Informatica, for example, would crunch on this data from its source. So you could land flat files on an NFS filer, and then you would crunch them, and publish the data to a database. Now you can just land those files and instead of having a separate filer that doesn’t have any compute processing in it, you can land them in a distributed file system like HDFS, which is generally co-located with a data processing engine like Impala. Then you can use Impala or Spark to do your ETL/ELT natively.

Greg Rahn: I refer to this as friction-free data landing. Some people might use the data lake term, but I think ultimately, it’s how fast can you go from landing data to publishing that data out for queries to be run on it.  And I think a big data platform does this well and is a big factor when comparing to the relational database world.

Michael Moreno: Let’s talk about Cloudera and how they work with Apache Impala. As many of our customers already know, Apache Impala is one of the key components of our Modern Data Warehouse offering. Could you give our readers a better understanding of what that actually means relative to open source solutions?

Greg Rahn: At Cloudera, we don’t sell individual Apache projects. We sell a big data platform and our platform has a number of different product SKUs in it. The one that people will use for data warehousing includes Apache Impala in it.  Cloudera Data Warehouse is basically a collection of the software that enables people to run a database/data warehouse like platform. Besides Apache Impala, it also includes technologies like Hive, Hive Metastore data catalog, and a number of other different components. Really the use case is just specifically around data warehousing and analytic database. It is powered by Impala on the SQL query side and Hive or Spark for the ETL side. But we just label it based on the use case because calling things by zoo animal names doesn’t have great brand awareness. Ultimately, we focus on giving our customers technology that allows them to solve their business problem and grow their business while not locking them into proprietary data models.

Michael Moreno: How do Cloudera customers currently use our Data Warehouse technology?

Greg Rahn: I would say the primary use case is very similar to what folks have been doing in data warehousing for the last several decades. But often times it’s a bit larger scale.  Sometimes it’s more around curating a very large number of data sources and the platform makes that easy. Sometimes it’s around building applications that need to scale and serve other data-driven applications. I would say it’s very much in the data warehousing arena. I would generally say that it is not a direct replacement for a traditional enterprise data warehouse. A typical enterprise data warehouse is a very sophisticated piece of software with a specific implementation. There are certainly many use cases based on the Cloudera Data Warehouse, but I wouldn’t specifically say it’s a full blown replacement for every traditional enterprise data warehousing model.  There are definitely many use cases that Cloudera Data Warehouse can handle quite well, and some that an EDW can handle a bit better today, so in the Venn diagram, so to speak, there is overlap.

Greg Rahn: But at the extreme end, I don’t think most organizations would use a Teradata EDW to do all their ETL, for example, like they would, say, Hive on the CDH platform.  Or they may not use Teradata to store all their archival history data that totals hundreds of terabytes, due to cost. And they might be using something like Spark for machine learning over that data to train it.  So there are different use cases. Generally, they have to do with size and complexity.

Michael Moreno: How do you see customers making that determination between choosing Cloudera’s Data Warehouse and a traditional enterprise data warehouse?

Greg Rahn: Generally, it has to do with the existence of a use case. So, a Fortune 500 company that has had a pillar enterprise data warehouse for years, is very unlikely to simply replace what they have unless they’re reaching the limitations of it. There are certainly people who are moving that direction. I would say recent use cases where there are high volumes or amounts of data coming in, say, from sensors, or events in general, tend to land in a platform like Cloudera. It’s certainly no longer like 2000 when every startup picked Oracle to run their back-end store for whatever site they were building — in 2018 there’s a variety of different database or data store engines. There’s MongoDB for document stores. There are key-value stores, and many other data storage layers to consider.

Greg Rahn: There are still relational databases, and there’s even more around the Hadoop ecosystem, and platform as well. I think new use cases tend to adopt newer technologies, and some older technologies are getting faded out. Legacy implementations are generally focused around the enterprise data warehouse, but often there are also data marts that are created from the EDW.  I have seen a lot of data mart use cases that are migrating to Cloudera’s platform, mostly because they can consolidate those data marts into a single platform and then have virtual marts inside of it, eliminating unnecessary data movement, but also the workloads typically tend to be more suited for what Cloudera’s platform is capable of today.

Michael Moreno: Now how do you see this playing out with the impact of cloud? Obviously, there are many organizations now storing their data in the cloud. How do you see that changing the data warehouse market and Cloudera’s Data Warehouse in that same context?

Greg Rahn: I think Cloudera and the software that it runs for its platform, including Impala, are very well suited for the cloud. In fact, HDFS has had an API and a connector for S3 several years. So, there was very little that needed to be done in terms of running Impala on top of an object store to get it working. Object storage like S3 is very similar to HDFS on many accounts. So, it made it a relatively easy transition. I think it has been harder for legacy technologies that relied on full stack implementations with local storage managers to adapt to the cloud. And what we’ve seen, and I think even Redshift is a testament to this, is that most of these cloud-based data warehouses are still very much deployed like the on-premises software that it came from. They are just using virtual machines instead of physical machines, but they still use local storage attached to those.  Running over an object store and separating or disaggregating compute and storage, I think, is the future. And Cloudera’s platform is already well suited for that and doing that today.

Michael Moreno: Excellent. How would you compare Cloudera versus cloud native competitors like Snowflake?

Greg Rahn: So, I’d say that Snowflake and Cloudera have a very similar approach. Snowflake looked the cloud and saw that they wanted to run on top of an object store natively, unlike Redshift. Cloudera’s Data Warehouse runs on top of an object store natively as well. Other similarities, I would say, are quite common. They have a single catalog that they share amongst different clusters. Cloudera’s Data Warehouse can be deployed in the cloud in a very similar manner too, using that same shared Hive Metastore as the catalog, and running more than one cluster over the same data that exists in S3. I would say the main difference between the two platforms is obviously that one is completely open source, relies on open file formats, and can be run both on-premises and on any cloud. It can also run on Google Cloud, it runs on top of an object store in S3, as well as ADLS. So, it’s very versatile, whereas Snowflake has to port their application to these other cloud vendors. And the big difference, I would say, between what Snowflake offers and what you can do with Cloudera software is that you can essentially get the same types of behavior, but all the data can reside within the customer’s own cloud account in open formats whereas Snowflake requires data to be loaded into their service and stored in a proprietary format. Cloudera Data Warehouse operates directly on Parquet files in a customer’s S3 bucket which means customers can also use other open source bits to process those files that don’t ship from Cloudera because anything that works on Parquet in S3, can also operate on this data. So, it completely avoids any vendor lock-in.

Michael Moreno: At Cloudera, we have a lot of great partnerships. I believe the last time I looked at the numbers, something over 2,600 partnerships. I know that in my role in marketing, I’ve worked with Tableau, Qlik, Zoomdata, and many other data analytics vendors. Which visual analytics tools do you see dominating the big data analytics space?

Greg Rahn: The traditional player I would say is Tableau. There are a lot of people who are migrating their implementation over whatever data store they have today, to Tableau using Impala on the Cloudera platform. There are quite a number of new implementations that I’ve seen that use Arcadia Data and I think they offer more of the data lake experience in terms of data browsing and data visualizations. So, I would say they’re probably one of the up and coming choices in terms of new BI tech. As you mentioned, Qlik is in there. Zoomdata is in there. There are probably a few others that I’m forgetting as well, but I would say those would be the more common engines from a BI perspective.

Michael Moreno: What are the key attributes that analysts or data scientists should be looking for in analytic tools?

Greg Rahn: For lack of a better description, “Hadoop aware or optimized”, tends to result in a better experience with the platform. There are a lot of features and functionality that the legacy tools rely on that may not exist in today’s platforms or there are simply different use cases that cannot exist in a relational world. So, things like a shared catalog and easy object browsing and being able to work with data that is a little less rigid or defined and iterate quickly. I think the speed at which one can get up and running — go from raw data to results quickly– is what I would look for in a tool.

Michael Moreno: And so we’re coming to the end of our Q&A session. In the case of futures, where do you see things going?

Greg Rahn: I think there’s an interesting set of forces at play in a large number of companies. There’s definitely velocity to the cloud but there is still large amounts of data gravity in the on-prem data center. I don’t think anybody would deny that.  Clearly, we see it, even at Cloudera, in terms of our customers. However, there are just some customers, like I was mentioning with the 750 terabyte table, that probably will not move to the cloud anytime soon, but they’re really looking to capitalize, if you will, on some of the deployment options or benefits that one gets from cloud.  Specifically, relating to the separation of compute and storage, and being able to add new workloads independently of storage. Call it on-prem cloud style deployments. For example, how does one bring the cloud advantages to the data center versus moving your data center to the cloud?

Greg Rahn: I think some of these very large organizations are still years away from moving to the cloud, and they’re going to want some set of functionality that resembles what the cloud offers today on-prem. I think there are some interesting things coming up there. Also, it’ll be quite interesting to see how the community and the data processing world moves from virtual machines to, say, containers in Kubernetes or something similar. I think there was a lot of interest in, and anxiety around, the direction there, and what that’s going to mean for the future as well. And you know, it’s similar to the move we saw from years back; moving from raw iron servers to virtual machines. So, it’ll be interesting to see what comes about there.

Michael Moreno: Yeah, there’s definitely a lot of activity in that space. Docker pretty much lit it up a few years ago and I think there are so many options available to data engineers, database managers, and how they store and access their data. It’s probably a bit like a kid in a candy store at this stage, would you say?

Greg Rahn: Oh, definitely. I would also say that it’s clear that the impact or the reason to use any, of these technologies is still not a widely known thing. A lot of times I’ve heard people ask what’s your container story? And I paused for a moment and I said, well, tell me what your container strategy and production deployment looks like. And they don’t really have an answer yet. But clearly, the buzzword is out there. They think they should be looking at it. So, my advice, unsolicited advice, is to think more around the use case of what you’re trying to solve, and why any of these technologies may help you in doing that, versus just running out and trying to adopt the technology because it’s at the top of Hacker News.

Michael Moreno: Yeah, that’s a very salient point you’re making. Definitely, something that went through my mind with these containers is that there needs to be a reason. You have your pre-production zones, your production zones, and you test out different scenarios, but there are many ways to do that. Including going, getting a server of your own, and just running pre-production workloads in an isolated server. But, yeah.

Greg Rahn: I think the cost of moving early can be a challenge also. Until a clear leader is defined, it might be better just to wait and see how the market goes.


For more information on

facebooktwittergoogle_pluslinkedinmail

Leave a Reply