We introduced Cloudera Impala more than a year ago. It was a good launch for us — it made our platform better in ways that mattered to our customers, and it’s allowed us to win business that was previously unavailable because earlier products simply couldn’t tackle interactive SQL workloads.
As a side effect, though, that launch ignited fierce competition among vendors for SQL market share in the Apache Hadoop ecosystem, with claims and counter-claims flying. Chest-beating on performance abounds (and we like our numbers pretty well), but I want to approach the matter from a different direction here.
I get asked all the time about Cloudera’s decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project. If there’s existing code, the thinking goes, surely it’s best to start there — right?
Well, no. We thought long and hard about it, and we concluded that the best thing to do was to create a new open source project, designed on different principles from Hive. Impala is that system. Our experiences over the last year increase our conviction on that strategy.
Let me walk you through our thinking.
Origins of Pig and Hive
The very earliest versions of Hadoop supported exactly one way of getting at data: MapReduce. The platform was used extensively at Yahoo! and Facebook to store, process and analyze huge amounts of data. Both companies found it tremendously valuable, but both spent a lot of time writing Java code to run under MapReduce to get useful work out of the system.
Both wanted to work faster and to let non-programmers get at their data directly. Yahoo! invented a language of its own, called Apache Pig, for that purpose. Facebook, likely thinking about using existing skills rather than training people with new ones, built the SQL system called Hive.
The two work in a very similar way. A user types a query in either Pig or Hive. A parser reads the query, figures out what the user wants and runs a series of MapReduce jobs that do the work. By using MapReduce, Pig and Hive were simple to build, so that Yahoo!’s and Facebook’s engineers could spend their time on other new features.
Sins of the Past
A common criticism of Hadoop is that MapReduce is a batch data processing engine. It turns out to be a hugely useful batch data processing engine that’s drastically changed the data management industry, but it’s a fair criticism: By design, MapReduce is a heavyweight, high-latency execution framework. Jobs are slow to start and have all kinds of overhead.
Pig and Hive inherit those properties.
Early on, that didn’t matter. No one was pointing their BI tools at Hadoop for analytics and reporting. Yahoo! and Facebook wanted more users to get at their data directly. If those users had to sit for a while staring at their terminal screens while the work happened, it wasn’t that big a deal. Slow queries were better than no queries. Many of those early workloads were, in any case, about data transformation, not query or analytics. Batch was fine for that.
Tools that use ODBC and JDBC to talk to relational databases are available from lots of vendors. Making those tools work against Hadoop is a really big deal — the number of casual business users who can get at data that way is enormous.
We wrote and shipped the first widely-available ODBC driver for Hive in our early years. Big established vendors used it to connect their reporting tools to our platform. The experience of using those tools was excruciating. A user would construct a query and push a button, and would wait… and wait… and wait. Decades of experience had taught people to expect real-time responses from their databases. Hive, built on MapReduce, couldn’t deliver.
In fact, Hive’s performance meant that some of the vendor tools flat out didn’t work. They assumed that the database had crashed when answers didn’t come back promptly, and reported an error to the user.
That led to a lot of angry users.
We knew we had to make SQL a first-class citizen in the Hadoop ecosystem, so we sat down and began working through the alternatives.
There are decades of experience in the database industry in building scale-out distributed query processing engines. It’s an active area for academic and commercial research. Plenty of very good, very fast distributed SQL systems exist. None of them uses a MapReduce-style architecture.
We also looked to Google, which created the Hadoop architecture in the first place. Google had developed a new distributed query processing engine of its own, not based on MapReduce, to get SQL access to its big data.
We elected, after thinking through all the options, to do exactly the same thing. We were optimistic about its performance, so named it Impala.
We hired a leader for the project, Marcel Kornacker, who had worked for a time at Google and who did his graduate work in database systems at UC Berkeley in the 1990s. Marcel assembled a team of seasoned distributed systems and database kernel engineers, and they got to work.
Every node in a Cloudera cluster has Impala code installed, waiting for SQL queries to execute. That code lives right alongside the MapReduce engine on every node — and the Apache HBase engine (not based on MapReduce), and the Search engine (not based on MapReduce), and third-party engines like SAS and Apache Spark that customers may choose to deploy. Those engines all have access to exactly the same data, and users can choose which of them to use, based on the problem they’re trying to solve.
Impala doesn’t have to translate a SQL query into another processing framework, like the map/shuffle/reduce operations on which Hive depends today. As a result, Impala doesn’t suffer the latencies that those operations impose. An Impala query begins to deliver results right away. Because it’s designed from the ground up for SQL query execution, not as a general-purpose distributed processing system like MapReduce, it’s able to deliver much better performance for that particular workload.
Why Not Improve Hive?
Put bluntly: We chose to build Impala because Hive is the wrong architecture for real-time distributed SQL processing. The landscape of parallel SQL databases is densely populated. No traditional relational vendor — IBM, Oracle, Microsoft, ParAccel, Greenplum, Teradata, Netezza, Vertica — uses anything like MapReduce. The next generation of shared-nothing scale-out systems — Google’s F1, notably, but also lesser-known options like NuoDB — all rely on native distributed query processing, not on a MapReduce foundation.
Facebook built Hive on MapReduce early because it was the shortest path to SQL on Hadoop. That was a sensible decision given the requirements of the time. Recently, however, the company announced Presto, its next-generation query processing engine for real-time access to data via SQL. It’s built, like Impala, new, from the ground up, as a distributed query processing engine.
It’s certainly possible to improve Hive, and to tune MapReduce, to speed things up and to reduce latencies. By design, though, Hive will always impose overhead and incur performance penalties that successor systems, built as native distributed SQL engines, avoid.
Tuning MapReduce to run Hive better is a mistake for another reason. MapReduce is excellent at general-purpose batch processing workloads, and working to specialize it for a specific query workload may impose a tax on non-SQL users. Back in the early days, the idea that a single engine would do all the work in Hadoop made sense. Today we see a proliferation of special-purpose engines (HBase, Apache Accumulo, Search, Myrrix, Spark and others). These extend the capabilities of the platform as a whole without compromising the workloads for which each is specifically tuned.
The Impala Community
We worked on Impala for nearly two years at Cloudera before publishing it under the Apache software license in late 2012. We did that because we think real-time SQL on Hadoop is a big deal, and we didn’t want to signal our intentions before we could actually deliver the product. We have a considerable team working on it.
Hive can’t compete on performance with a modern distributed query engine for real-time SQL. There are, though, components in the Hive ecosystem that don’t rely on MapReduce. We have worked hard to integrate those with Impala, and continue to contribute enhancements to Hive as a result of our work.
We collaborate, for example, with the community on HCatalog and the Hive metastore as a metadata repository. HCatalog is central to our Enterprise Data Hub strategy, since sharing data among lots of different engines requires shared metadata that describes it. Our use of the Hive metastore and HCatalog means that Hive, Impala, MapReduce, Pig, and REST applications can share tables.
Twitter, Cloudera and Criteo collaborate on Parquet, a columnar format that lets Impala run analytic database workloads much faster. Contributors are working on integrating Parquet with Cascading, Pig and even Hive.
User- and role-based authentication and access control are crucial for database applications. Cloudera created the Sentry project to define and enforce those security policies for Impala and collaborates with Oracle and others there. Without Sentry, every Hive user gets to see every table and every record that belongs to every other Hive user. We’ve contributed the code to integrate Sentry into Hive but it’s not uniformly included in commercial distributions today.
Who Ships Impala?
Open source is good because it means that enterprise users can work with the company that offers the best mix of software and services at the best price. They’re not locked into a single vendor.
Impala, to succeed, must be shipped and supported by lots of companies.
Naturally, you can get Impala from Cloudera. It’s core to our platform and central to our Enterprise Data Hub strategy. That means it’s available from our resellers, as well. Oracle ships the entire Enterprise Data Hub software suite from Cloudera, including Impala, on its Big Data Appliance. Cloud vendors including the Softlayer subsidiary of IBM, Verizon Business Systems, CenturyLink Savvis, T-Systems and more deliver “Enterprise Data Hub as a Service” in their managed clouds, Impala included.
Amazon delivers its own big data service called Elastic MapReduce, or EMR. Just last week, Amazon announced availability of Impala — the third engine included as part of its EMR offering, along with MapReduce and HBase. Amazon’s endorsement was really exciting for us. It’s a clear vote of confidence in our architecture, and a huge new collection of potential users. That, in turn, means that more applications and tools vendors will integrate with Impala.
Most interestingly, in my view, is the fact that a head-to-head Cloudera competitor now offers Impala to its customers. MapR customers can use Impala on the company’s commercial product.
For more than five years, Cloudera has been building and shipping new open source code in the Hadoop ecosystem. We have contributors and committers working across a huge range of projects. We’ve created projects — Apache Flume, Apache Sqoop, Hue, Apache Bigtop — that have been picked up by our competitors as part of their offerings.
Impala is merely one more example on that list, and just one more piece of evidence that our commitment to the open source platform is deep and real. No lip service here — we’ve been living up to those words since our very first days. The fact that you can get Impala today from more than eight different vendors, including both our partners and our competitors, absolutely validates our open source strategy.
Why So Many SQL Engines?
Survey the landscape and you’ll see that there are at least as many SQL engines running on Hadoop as there are vendors in the market. Cloudera offers Impala. IBM likes its BigSQL offering. Pivotal is clearly committed to HAWQ. Hortonworks has declared its allegiance to Hive. Hadapt, an early entrant, has expanded its ambitions to take on NoSQL workloads as well, but continues to deliver SQL in the Hadoop framework. And so on.
This is the same dynamic that has driven the established relational database players for several decades. Everyone implements a standard language specification, and all the vendors compete on performance and extended services. There is far too much existing product in the market to collapse those offerings into a single one. Frankly, there’s too much money at stake to drive shared investment.
There’s good news, though. Users are insulated from difficulties because the applications that talk to those engines over SQL, JDBC and ODBC are portable. Cloudera believes that open source is fundamental, and that Impala will win. We nevertheless recognize that the language standard forces us to compete on a mix of power and performance combined with the best ecosystem of third-party products that integrate with our product. That competitive dynamic is the one that is best for our customers and the market at large.
Whither Pig and Hive?
Pig, the Yahoo!-created dataflow language, has no competition from existing vendors. The language is powerful, but its user base is too small to attract the attention of big vendors with alternative implementations. There may be better ways than MapReduce to run Pig queries, but based on the workloads it supports (mostly transformations), that doesn’t matter; the software is plenty good in its current form, and the Pig community continues to enhance it.
Hive, on the other hand, is SQL, and SQL is intergalactic dataspeak. Everyone knows it. Everybody expects it to be fast. Lots of companies support it.
Hive is, therefore, sure to be under fierce competitive pressure. As more big vendors enter the market, that gets steadily worse. We don’t believe that the implementation can keep up with the demands of high-performance interactive query and analytics. A horse can only go so fast, no matter how hard you whip it. Users, and the tool and application vendors that serve them, will migrate to better offerings.
Hive handles data processing jobs just fine, and it’s used widely in the Cloudera customer base for just that purpose. It has, after all, been part of Cloudera’s commercial offering for the past five years. Because it’s got such presence in the installed base, we’ll continue to support it. We’re especially keen to drive areas of overlap with Impala — witness the Hive integration of Sentry and Parquet, described above.
But we must have a first-class, real-time, open source SQL engine in our Enterprise Data Hub. Our forward bet for data analysis, real-time query and high-performance SQL applications is solidly on Impala. We may, of course, be biased, but we’re convinced that it’s already the very best SQL offering available in the Hadoop ecosystem. We’re very pleased at its uptake by customers and by other players in the market. We are excited about the services and features that will roll out in the next several releases.