Apache Spark, Cloudera Search, Impala — Which is best for Analytics?

Categories: Partners

This blog was penned by Justin Langseth, the CEO of Zoomdata

As the leading Big Data Visualization, Exploration and Analytics platform, Zoomdata has been designed to take advantage of the many advanced features in next generation data stores such as Cloudera Impala, Cloudera Search and Apache Spark. In addition, rather than moving data from these data stores into a proprietary data environment for reporting, Zoomdata’s executes it queries directly in these data stores.  As a result, Zoomdata’s customers are able to analyze terabytes and petabytes of in a matter of seconds.

1. Direct Analysis of Raw HDFS Data

There is a lot to be said for analyzing raw data directly.  Simply store it in HDFS, Hadoop’s Distributed File System, and sort out the structure later. However, the downside of this “schema-on-read” flexibility is latency. Using MapReduce, queries can take many minutes to hours depending on the query and size of the dataset. While there have been efforts to speed up MapReduce, its design means it will always suffer from latency compared to new query frameworks.

2. Analytic SQL with Impala

Impala is the open source, analytic database that runs natively in Hadoop. It provides business intelligence and data discovery solutions for analysts and business users alike with the fastest querying times.

Zoomdata integrated with Impala early on, and the results were dramatic.  The video below shows Zoomdata using the micro-query sharpening approach on Impala to analyze a billion rows of sales transactions data nearly instantly. Not only is Impala much faster than direct analysis of HDFS raw data, we believe it is the leading way to store big data while retaining the ability to analyze it very quickly – especially leveraging the standard columnar storage format, Parquet.

3. Full-Text Search with Apache Solr (Cloudera Search)

One of the strengths of Hadoop is that it can store full fidelity data of any type, be it structure, semi-structured, or unstructured. Open standards like Apache Solr, which powers Cloudera Search, make it easy to search the semi-structured and unstructured data. This full-text search engine also opens up the data in Hadoop to any user who simply knows how to “Google.” By indexing data into Cloudera Search, all data (regardless of structure) can be analyzed, but with the ability to do free-text search and leverage facets.

The video below shows Zoomdata leveraging a Cloudera Search index of TripAdvisor hotel reviews, allowing for structured (graphs), semi-structured (facets), and unstructured (search) analysis all at once.

4. In-Memory Analysis with Apache Spark

Apache Spark seems to be everywhere today.  It’s a great way to process data and it’s also a good place to do light data preparation work and test machine learning algorithms.  If the dataset is small and can fit in-memory, Spark is blazingly fast.  The video below shows Zoomdata operating on sales transaction data, directly in Spark:

Why Choose?

All four approaches (raw HDFS, Impala, Search, and Spark) have their place and are good for different use cases. As a partner in Cloudera’s Accelerator program, Zoomdata is certified in Cloudera 5. Through the Cloudera Accelerator program, Zoomdata is working with the Cloudera team to highlight the use cases that take advantage of Impala and Spark technologies.  From a pharmaceutical company being able to explore billions of rows of data on an iPad in seconds to an adtech company being able to visualize the location of millions of mobile phones on a map, Cloudera and Zoomdata are helping businesses take advantage of big data. The Zoomdata team is also working closely with Cloudera to maximize the impact of analytics across all of the Cloudera engines, to build bridges between them, and most importantly to make analysis using Impala and Spark fast, fun, and easy for business users so they don’t need to worry about what is happening under the covers.

facebooktwittergoogle_pluslinkedinmail

One response on “Apache Spark, Cloudera Search, Impala — Which is best for Analytics?

Leave a Reply