Open Standards in Apache Hadoop: Impala

Categories: Open Source Software Product

This blog was penned by the following Clouderans: Alex Gutow, Justin Erickson and Justin Kestelyn.

Continuing the series on balancing stability and innovation through open standards, let’s take a look at Impala, the open source standard analytic database for Apache Hadoop.

It seems like a long time ago that Apache Hadoop was considered strictly “batch.” When Impala was introduced into the open source ecosystem in late 2012, the game changed: for the first time, business intelligence-style exploratory data analysis on Hadoop become not only possible but pragmatic (with the same performance expected from traditional analytic databases). This helped solidify Hadoop’s place as a flexible, multi-access framework platform and the community took quick notice.

Since that time, Impala (now in its 2.x release) has made an impact across the ecosystem, with:

  • More than 1 million standalone downloads (and millions more as a part of CDH and other Hadoop distributions)

  • Product support from multiple distributions, including AWS, Cloudera, MapR, and Oracle (with portions included in IBM Big SQL code base)

  • Active community involvement featuring hundreds of conference and meetup talks (and even as the subject of three books)

  • Development of multiple client frameworks from the broader community (including Java, Ruby, Python, and R)

  • Integrations from all the leading business intelligence tools

Impala has developed into a true open standard in the Big Data ecosystem – a fully open source tool with the quality, innovation, and support necessary for any production environment. As an integrated part of Cloudera’s platform, it has also enabled the foundation (along with other open standards such as Apache Spark, Apache Kafka, Apache Parquet, and Apache HBase) of the next-generation analytic workloads that are proliferating before our eyes. Thanks to this status as a de facto standard, Impala will continue to be a critical building block in mainstream, long-term, open architectures.

As with any open standard, integration and ecosystem compatibility are critical, with Impala boasting an impressive list of partners used across nearly all business. We asked some of these partners for their views on Impala and where it’s going in the future.

Why has Impala become the open standard for analytic databases on Hadoop?

AtScale:

Impala has changed the game for SQL queries on Hadoop, bringing MPP scale query performance directly to data stored in Hadoop. By combining Impala with AtScale’s VROLAP (Virtual ROLAP) architecture, enterprises can now bring production Business Intelligence workloads directly to their enterprise data hub – eliminating costly and time consuming data movement and vastly reduces BI system complexity. Impala clearly satisfies the requirements of a BI-enabling SQL-on-Hadoop engine: it works on Big Data, is fast on Small Data, and is stable for Many Users.

Cloudera:

Impala uniquely unlocks direct business intelligence and analytics on Hadoop through its architecture, built from the ground up, that combines state-of-the-art MPP database technology with the flexibility of the Hadoop ecosystem. Its leadership in multi-user performance, SQL compatibility, and enterprise-wide usability has enabled integration from all the leading BI tools and motivated broad customer and multi-vendor adoption of Impala as the open standard for analytic SQL.

Qlik:

Impala’s open standard, with its established enterprise capabilities, is driving remarkably rapid adoption within the marketplace and Qlik customers. Qlik business users can bring insights and clarity to where it’s needed the most: the point of decision.

MicroStrategy:

HDFS has undeniably become the preferred technology for storing large volumes of multi-structured data. More recently our customers are looking to utilize Hadoop not only as a low-cost data store but also as a data warehousing platform. Impala provides an excellent analytical database on top of Hadoop in support of this growing use case. Certified to work with MicroStrategy Enterprise Analytics, it provides our customers with an enterprise-strength solution for running analytic workloads against data in Hadoop.

SAS:

No question that Impala’s scalability and interactive query performance is the reason why it has become a choice for many analytic workloads on Hadoop. SAS has long recognized the importance of SQL and accessing data on Hadoop, and will be expanding our integration capabilities of SAS and Impala.

Tableau:

Impala has become the open standard for analytic databases on Hadoop because of its performance, flexibility, and popularity. It is the fastest SQL-on-Hadoop engine and it is the first that truly enables interactive analysis of massive data sets. This is important because as data volumes explode and data insights become the new lifeblood for organizations, analysts need a lens into the rich data stored in Hadoop.

For Tableau, the ability to interactively query data directly on Hadoop is paramount to realizing our mission of “helping people see and understand data.”  Impala opens up the deep data stored in Hadoop – data that wouldn’t have been cost effective to keep in a data warehouse. Impala’s speed is important because nobody likes to do things slowly; a good user experience is necessary for facilitating a conversation with your data.  Additionally, the ability to query Hadoop directly also means faster time-to-value and access to near real-time data.  Data is now available for querying as soon as it lands in Hadoop in its rawest form and without the requirement  of moving it into a data warehouse. That’s where Impala’s flexibility comes into play – it enables users to query multiple data types and eliminates the need for data migration, conversion or duplication in many use cases.

Zoomdata:

Impala is emerging as a new standard for analytics databases  because it delivers critical functionality that enables products like Zoomdata to deliver Big Data Analytics in a streaming architecture. Zoomdata with Impala lets users visualize enormous volumes of data stored in their Hadoop/HDFS cluster in real-time and without ETL. Instead of needlessly moving data around to enable analytics, with Impala we can now bring the processing to the data and move beyond the batch-oriented architectures of the past.

What’s next for your organization and Impala?

Cloudera:

Despite the already impressive traction thus far with BI on Hadoop, we are just at the tip of the iceberg of what will be possible in the near future. Our customers are pushing Impala into ever higher concurrency, node scalability, and enterprise business-criticality so we are continuing to drive improvements in those areas. We are also working to add even greater flexibility and usability with features like nested structures, updatability, active data optimization, and more integration opportunities with our partner ecosystem. Despite Impala’s large performance lead, this is just the beginning of the multi-user performance of Impala and we are excited about the concurrency and performance initiatives we’re working on together with Intel.

MicroStrategy:

As our customers continue to invest in both technologies, we plan to offer even tighter integration with Impala, enabling users to leverage the full extent of MicroStrategy’s comprehensive suite of analytic capabilities for data discovery, enterprise reporting, and mobile applications, in a Hadoop environment of limitless scalability and unrivaled cost-effectiveness.

Qlik:

At Qlik, innovation  =  growth, and this is key to the success of our customers. Developing deeper integrations with open standards is a key element of that strategy. Therefore, we will continue to optimize and take advantage of the key capabilities of Impala, as they are readied for the market.

SAS:

Since 2014, the SAS/ACCESS Interface to Impala has allowed SAS users to leverage the power of both SAS and Impala. In our next release, SAS will enhance this integration by adding more in-cluster capabilities to further take advantage of the processing power and performance capabilities of Impala.

Tableau:

Today, Impala is critical to providing our customers with the ability to visualize data on Hadoop and facilitate real-time conversations with their data. Our customers are always seeking to retain more data and roll out their Hadoop deployments to more users. For us, that primarily means pushing the boundaries on performance and seeking ways to make the data discovery experience more frictionless and delightful. We’re very excited to see the continued investments in Impala’s SQL functionality, handling of nested data and support for other data types, and we’re excited to expose that functionality to our users.

Zoomdata:

Zoomdata is working very closely with our customers to better exploit Impala’s capabilities, and deliver sub-second response times on billions of rows of data. We also continue to optimize our unique micro-query architecture to distribute query processing scalably and efficiently. Impala is a critical component in our native support of a big data analytics architecture.

To learn about how to contribute to Impala, read this Cloudera Engineering post.

For more details on Cloudera’s view of open source and open standards, check out “Cloudera’s Commitment to Open Source and Open Standards,” and be on the look out for the next blog examining the open standards for real-time streaming architectures.

facebooktwittergoogle_pluslinkedinmail

One response on “Open Standards in Apache Hadoop: Impala

Leave a Reply