How Kudu Will Simplify Visual Analytics on Apache Hadoop

Categories: Enterprise Data Hub General Partners Product

At Strata + Hadoop World in NYC in September, we announced the beta of Kudu, a new Hadoop storage engine for fast analytics on fast data. With that announcement, we had the honor to have Zoomdata participate in the beta program. Today, I have the honor of hosting Ruhollah Farchtchi, vice president of Zoomdata Labs, to talk about how Zoomdata will work tightly with Kudu to simply visual analytics on Apache Hadoop.

—–

Zoomdata is excited to announce that it has been working in conjunction with Cloudera and the open source community to integrate big data visual analytics capabilities on top of Kudu, the new, Apache Hadoop storage engine for fast analytics on fast data.

Working with the combination of streaming and historical data at scale has typically been a very challenging proposition for enterprises doing analytics on top of Hadoop. At Zoomdata, we see this firsthand with our customers. Customers choose Zoomdata because we work really well with streaming data but also with large volumes of historical data. But these enterprises are challenged with building out a data management architecture that can support both.

As a result we have customers who are currently forced to deploy parallel infrastructures. For example, they persist real-time or most-recent data into HDFS using Avro and then periodically (daily) convert the most recent data into Parquet format for analytic queries. The real-time analysis runs against Avro storage, but the historical queries need to run against Parquet for good analytic performance.

Having these parallel infrastructures is complicated, and gets even more so when you consider variability in the velocity of data. If you have fewer transactions in a given time period you could compact too soon and have small Parquet files, which are not ideal and could lead to performance issues.

Using Kudu greatly simplifies these architectures.

With Kudu, you’re letting a purpose-built storage subsystem handle distribution of data. It’s easier to manage because you can do intelligent things such as using a hash and get really great distribution, which maximizes the parallelism of your queries. You don’t have to worry about the low level data management issues there. And when processing re-statements of data, you won’t have to go and re-run your compaction routines.

To illustrate the benefits of a Kudu-based approach, we built a simple demonstration showing both real-time monitoring and analytic queries on the same dataset. It’s a scenario that our customers face across multiple industries. For example, telecoms monitoring current network activity, then looking at network usage trends over time. Or website product managers monitoring current user activity on the site, then examining historical behavior patterns. The general pattern in these applications is to monitor what’s happening over just the past few minutes, then quickly switch to asking questions requiring the full history.

Now with Kudu, we can continue to point at the exact same table and also run analytic queries over the full data set. Zoomdata runs analytic queries on the fly, so as soon as data is inserted into Kudu it is available to users for visual analytics — there is no latency due to a batch process for converting storage formats. Further, we have a special technique at Zoomdata called Data SharpeningTM that lets us query even huge tables with hundreds of millions to billions of rows, while presenting results to the user in seconds, sharpening the image as the rest of the query completes. We really like that Kudu is integrated with Impala and the rest of the existing Hadoop ecosystem. This means that our existing query optimization logic that was built for Impala works seamlessly for Kudu.

All in all, this mix of real-time and analytic queries is made possible through the Zoomdata user experience and now the simplified underlying storage architecture provided by open source Kudu.

—–

Ruhollah Farchtchi is Vice President of Zoomdata Labs and also responsible for Customer Solutions at Zoomdata. He has over 15 years experience in enterprise data management architecture and systems integration. Prior to Zoomdata, Ruhollah held management positions at BearingPoint, Booz-Allen and Unisys.  He holds an M.S. in Information Technology from George Mason University.

facebooktwittergoogle_pluslinkedinmail

Leave a Reply