A simple guide to understanding SQL-on-Hadoop choices
When it comes to SQL-on-Hadoop, it is easy to feel overwhelmed with the number of choices available in the ecosystem. However, in reality, choosing the right tool to best help you tackle the job at hand can be simple if you know what you are planning to achieve, which in most cases is typically apparent from your job title or team function. There are three main classes of SQL-on-Hadoop tools available – ETL and Data Preparation Tools, Analytic Databases, and Data Engineering Tools. Within the Hadoop ecosystem, the prominent choices that emerge for each of these are Apache Hive, Apache Impala (incubating), and Spark SQL, respectively.
Here is a quick summary:
Now, let’s look at each of these in more detail.
ETL and Data Preparation Tool: Hive
Hive was designed to make MapReduce batch processing accessible to a wider group of users via SQL-like language – HiveQL and consequently, Hive is best-suited for the same use cases as MapReduce. Hive-on-Spark provides the same great Hive capabilities, using the next-generation MR-style execution engine of Apache Spark instead of MapReduce for incrementally faster processing (Hive-on-Tez provides similar capabilities but using Tez instead of Spark). Compared to MapReduce, Spark provides easier development with a rich set of APIs in Java, Scala and Python to make use of Spark’s DAG engine, which delivers faster processing by in-memory caching of data between iterations.
Hive’s ability to run scheduled batch jobs with massive ETL sorts with joins makes it one of the most widely used data preparation tool for Hadoop and most data served to BI users in Impala today is prepared by ETL developers using Hive.
Data Engineering Tool: Spark and Spark SQL
Spark is a powerful data engineering framework with a rich a rich set of developer APIs from core Spark primitives, SQL APIs, machine learning algorithms with MLlib, streaming APIs, Python/R interfaces with Spark-R and PySpark etc.
At its core Spark is a next generation MR-style system where instead of being confined to simple Map and Reduce operators, you can develop a more robust set of Spark operators using RDDs as operator interchanges instead of a forced Shuffle stage. SparkSQL, like the rest of Spark, translates SQL into underlying stages of Spark operations on RDDs. This design easily allows the intermixing of SQL with other Spark APIs so data engineers can use SQL for common data preparation tasks like aggregations, joins, and filtered as part of a broader Spark application that might use core Spark APIs and then feed into a machine learning algorithm from MLib.
For example, let us consider iteratively building logistic regression over a data-set using Spark and compare it with MapReduce. With Spark, you can embed SQL within your Spark app written in Scala or Java and use MLlib to build the regression model. As seen in the graph below, Spark performs much better by caching data in memory between iterations compared to MapReduce.
In a recent announcement about The One Platform Initiative, we outlined our Spark roadmap doubling down on security, scale, management, and streaming. With its developer friendliness and much better performance, Spark is set to supplant MapReduce. Many of our customers like AllState and Cox Automotive have already leveraged Spark to boost their data processing pipelines.
Analytic Database: Impala
Impala is modern MPP query engine that’s purpose-built for Hadoop to provide BI and SQL analytics at interactive latencies. Analytic databases like Impala are used by data analysts and BI users, typically in conjunction with tools like Tableau, Zoomdata, or Microstrategy. For these users, there’s a HUGE advantage of clicking on a report or visualization and getting a response in seconds, versus having to wait minutes. Thus, interactivity is critical and must be maintained as these tools scale to hundreds or even thousands of users.
In a recent performance metric, we compared Hive, Impala, and Spark SQL with analytic workloads. Based on the discussion we’ve had here, it should not come as a surprise that Impala — designed for analytic workloads — performs best where interactive query latencies over multiple users is critical.
Not only is Impala the only one of the SQL-on-Hadoop engines that delivers responses in seconds compared to minutes, but also Impala’s performance is better when compared to traditional DBMS and analytic databases. Many of our customers such as Quaero and Epsilon have benefited first hand from Impala’s performance lead when they compared it to traditional systems such as Netezza. Like we have discussed in the past, performance is an ongoing priority for Impala and, the recent release of Impala 2.5 sees 4x better performance compared to the previous version used in the benchmark above. With multi-core joins and aggregates and a variety of other performance enhancements coming up, we expect Impala’s performance lead to widen further on analytic workloads.
As you can see, Hive, Impala, and Spark SQL all provide best-of-breed, but different, functionalities and all serve an important purpose within a complete Hadoop platform to address multiple workloads. No matter what the task or role – be it data preparation, BI, or data science – there is the right tool for you in the ecosystem. In fact many of our customers have different groups of users using all three technologies. Today we find most of the data served up for analytics by Impala is prepared by Hive and 80% of our Spark customers also use Impala to serve up their data to their analysts while their data engineers and data scientistic use Spark on this common data for statistical analysis.
To learn more about these technologies, check out the webinar “SQL-on-Hadoop Technologies in Cloudera”