A week or two ago, I talked about the current and future state of the Apache Hadoop ecosystem, and of several key initiatives here at Cloudera to improve the platform and to embrace new big data workloads. We devoted a good deal of time at the end of the session to Q&A with attendees, but there were many more questions than we had time to answer during the hour.
Given that many of the questions concentrate on common themes, I thought I’d use the blog to expand on the original session, and answer more of those questions here.
Just about two years ago, we predicted that Apache Spark would supplant MapReduce as the general-purpose processing engine in the Hadoop ecosystem. That doesn’t mean that MapReduce will disappear — we still have a lot of customers who use it in existing applications every day — but Spark is so much easier to program, and so much more flexible, that we expect most new workloads to start on Spark rather than on the older MapReduce framework.
That’s now clearly true, and I gave some examples of how and where it’s happening in the session. There were a lot of follow-on questions asking for more detail.
Q: Other than MapReduce, will Spark replace other tools in the ecosystem?
There are a bunch of special-purpose tools in the ecosystem — dedicated SQL systems like Impala, document-centric services like SOLR/Cloud for search, and so on. I don’t think those are going to be replaced by Spark. All these engines can share the identical data because they all run in the Hadoop ecosystem, on a shared storage layer.
What we see in practice in enterprise software — and this has been true for decades — is that customers want a collection of tools, and not just a hammer. General-purpose engines like MapReduce and Spark give you flexibility and adaptability. They’re hammers and can whack all kinds of things. When there’s a particular job you do a lot, though, you care about performance and ease of use. We can pretty much always build a faster, better, more stable, and more powerful special-purpose system, for those high-value workloads, than we can adapt a general-purpose system. For example, our Impala query engine is natively designed to run SQL in a massively-parallel way. We get to make a lot of special-purpose optimizations because we don’t have to worry about non-SQL workloads. It beats the competition handily for analytic SQL jobs.
MapReduce is really the only general-purpose engine that’s emerged in the Hadoop ecosystem, prior to the advent of Spark, so it’s the only candidate I see for the direct replacement I described during the webinar.
Q: How will the Spark ecosystem deal with high concurrency?
The simple answer is traditional engineering work will make Spark incrementally better over future releases. There are a few easy things to do — whenever there’s a centralized component of Spark that handles scale-out workloads on behalf of multiple users, we need to be sure that it can handle thousands of machines with lots of cores, and that it can keep track of the user population efficiently. We need to pay attention to throughput and latency.
The Spark runtime code on each of the nodes in a cluster will also improve steadily. The global development community will analyze code paths, look for contention, and generally improve the code to make it run faster and use resources more efficiently. We’re excited about some of the work we’re doing with Intel to optimize Spark for the current and next generation of compute and memory systems.
There’s a big opportunity, as well, to improve not just Spark’s behavior, but the behavior of analytic jobs that run under Spark. Spark executes user code on each of the nodes on which it runs — that’s because it’s a distributed processing framework, and the user code is where the work actually gets done. That’s also true for MapReduce, by the way.
So if users are writing code that is memory-hungry, high-contention, or generally inefficient, overall cluster performance can fall off pretty fast. There are common libraries used widely by application developers — MLlib is an excellent example — where we can concentrate on efficiency and multi-tenancy, and make life much better.
Those efforts are all part of the One Platform Initiative that I described in the session.
Q: How can Spark be used to develop a unified Internet of Things platform? Does Cloudera Enterprise address this?
Spark is an important tool for IoT workloads. There are two areas in particular where it can help:
- Streaming data ingest. Along with other components, like Apache Kafka, Spark (and in particular Spark Streaming) is a good way to capture data, clean it, and do simple window analytics on it as it arrives. MapReduce could never have been used in that way due to its high latencies.
- Ongoing, interactive analytics of IoT data. If you’re catching sensor data, or user activity, you may want to run up-to-the-minute analyses of what’s happening. Spark, like Impala, can run those large-scale analyses quickly and well.
But it’s important to remember that Spark is a powerful tool, but only one of many that you’re likely to need to build these applications. I’ve already mentioned Kafka and Impala as others that we see used by customers for IoT workloads today. All of those need to integrate with an underlying storage layer, run securely and handle multi-tenancy well. Those platform capabilities are bigger than just the core Spark project.
Cloudera Enterprise bundles all of these components, with a single consistent security model, data governance framework, and management suite.
Q: How would you compare Hive, Impala, and SparkSQL?
This isn’t really a Spark question per se, but I get asked often enough that I figured I should answer it here.
Recall that Apache Hive was originally built at Facebook because the company had a bunch of business analysts who knew SQL and wanted big data. The Facebook development team wrote a SQL parser that turned the queries into a collection of jobs that ran on MapReduce. This was good — those business people got their SQL — but it inherited all of the problems of MapReduce. Most especially, it was really slow.
MapReduce is a general-purpose execution engine, and Spark is a general-purpose execution engine, and if Spark is going to replace MapReduce, you’d expect that Hive could be ported to use Spark instead of MapReduce. You’d be right. That work is already well underway.
The Hive query planner is still pretty limited, and doesn’t handle complex queries with lots of joins and sophisticated predicates well. It does a fantastic job of running simple queries that do data transformation, but it’s really not good at handling analytic query or interactive BI workloads. As a result, we see it used extensively in production for extract-transform-load (ETL) jobs, but not for too much else.
We expect Hive to continue to serve that use case well. It’s been part of our platform from our very first release, and we’ll continue to ship and support it for the long term. I do expect that Hive on Spark will supplant Hive on MapReduce, even for most current production workloads (also true for Apache Pig, by the way). If you love Hive, though, you can rest assured you’ll be able to keep using it.
Impala, by contrast, was designed from the beginning to handle high-concurrency analytic query processing workloads. It takes advantage of decades of research into massively parallel database query processing systems and borrows from publications by Google and others in building those systems for enormous scale. The community will continue to drive concurrency and performance, but it’s already true that if you’re doing high-performance analytic SQL on Hadoop, Impala smokes every alternative. We expect that to stay true.
Like Hive (and very much like Hive-on-Spark!), SparkSQL is query processing on a general, not dedicated, database engine, so we don’t expect it to compete with Impala on performance or concurrency. Where it does shine is for developers building Spark applications who want to make SQL queries against their data from directly inside those apps. Like embedded SQL in programming languages like C++ and Java, embedded SQL for Spark apps makes it very easy to fetch records and bind them to program variables easily.
There is, no question, overlap among these three, but they each have special properties that the others lack. While it can be a little confusing to have three choices in the platform, it’s one of the benefits of an open source ecosystem that different options with different design goals can coexist so easily.
I spoke at some length about Kudu, the new storage option we announced in September. Kudu was designed to fill an important gap in storage in Hadoop: HDFS is great for storing and processing big files and log data, but can’t handle updates or random access well. Apache HBase is excellent at random access to single records for NoSQL workloads and can handle updates, but doesn’t support common analytic workloads like scanning records in order well.
Kudu, by contrast, is designed to handle updates and support scans — the kind of operation you want if you’re looking over sensor data by time, stock trades by ticker, and so on. We think of this as “fast analytics on fast data” — good performance for the analytics, and the ability to land new data, and update old data, quickly.
Q: Can you give some examples of fast data and when you might use it?
I’ll let one of our partners, Zoomdata, do that. The Zoomdata team built a demo of real-time data ingest, and online query of that data, using Impala running on Kudu. You can watch the Youtube video here. It’s a great demonstration of how Hadoop has grown up to handle the kind of demanding real-time workloads that were impossible to support just a few years ago.
Q: Does Kudu support CRUD operations or transaction management?
For those of you who didn’t grow up in the database world, “CRUD” means “create, read, update, delete.” The answer, for CRUD, is yes — it’s one of the key shortcomings of HDFS that we wanted to eliminate. You can add data, fetch it quickly, change it, and delete it. The store is mutable in all the ways you would expect tables to be.
Transactions are trickier. General-purpose ACID (atomic, consistent, isolatable, durable) semantics are very tough to support in large-scale distributed systems. There’s no express support in the storage layer — in any of HDFS, HBase, or Kudu — that will handle multi-statement updates across different machines with ACID guarantees. There are companies that offer those services, like Esgyn and Splice Machine, by building additional locking, logging, and recovery features to complement the storage layer.
Kudu offers some limited atomicity guarantees for single-node updates, and the recovery and replication features of Hadoop storage generally provide some durability assurance. On its own, though, Kudu is not a transactional storage engine.
Q: Does Impala work with Kudu? Can Kudu be used instead of Impala?
Yes, you can use Impala to query data stored in Kudu — see the Zoomdata demo I talked about above. But Kudu isn’t a replacement for Impala. Impala is the massively-parallel distributed query processing engine; Kudu is a storage layer. You can choose to store your tables in any of HDFS, HBase, or Kudu, depending on the kind of analytic queries you expect to run.
Q: Will Kudu replace HDFS or HBase?
No. Each of the three is aimed at a specific data access pattern and the workloads that need that flavor of performance. Together, they offer developers and analysts a broad range of reliable, high-performance storage choices. That allows Hadoop to handle more different big data workloads than ever before.
In case you missed any part of the webinar, the full recording is now available.