Recently we announced how Impala provides a compelling new platform for Data Discovery and Analytics. Today, we are happy to host this guest blog from Shant Hovsepian, CTO and Co-founder of Arcadia Data, a contributor to the Apache Impala project.
When people talk big data and Hadoop, it’s usually about the scale of data and scope of the sources. With the latest release of Apache Impala (incubating), the scale of the user base — beyond data scientists and analytic programmers — is set to change dramatically.
Imagine this use case: You wake up at 2 am from a dream. In it, you are a customer service rep, on the phone helping one of your company’s subscribers, looking at her account information. Is there a risk that the customer “churns” — unsubscribes and costs your employer hundreds of dollars (and maybe your bonus)? Or should you offer her a win-win discount promotion? One more thing: in this dream, you are neither a data scientist nor a programmer. Do you wake up in a cold sweat?
Impala could make the difference.
Traditionally, business intelligence (BI) and data discovery have been applied to formally structured relational OLTP-based data output, so it’s easy to imagine transactional data on a cell-phone customer service console: monthly billing, data usage, number of calls, and so on.
Can that predict churn? Unlikely. Now, think about what questions you’d actually ask to determine that customer’s churn risk: Signal strength during her calls? Calls made on weekends or weekdays? Rush hour or all day long? Antenna repairs where she spends the most time? Just as easy to imagine: that data set is huge and resides in multiple data stores, and you can’t predict exactly what questions you might want to answer before you run any queries.
For all its virtues, Hive is not ideally well-suited to interactive or complex queries. This is where data discovery comes in: Graphics-driven analytical users and BI specialists, who are often not data scientists, need a way to see what fields are available, what kinds of values they contain, and so on. Ideally, this sort of analysis allows them to iterate through queries of arbitrary complexity: ranging from a simple aggregation all the way to something as complex as augmenting repair status with geospatial info, as well as time-of-day/day-of-week usage derived from call timestamps.
This should require no logistical regressions or other advanced math. In fact, that’s the point: many, many users, whether formally trained in BI/analytics or not, know how to formulate a hypothesis about relationships — and can do so visually with a simple chart.
At Arcadia Data, we’ve built a converged visual analytics and BI platform with Impala, designed specifically for this kind of user. There are three essential problems we’ve set out to solve:
- Visualizing any data is essentially universal, as long as it doesn’t sacrifice fidelity or security. If your phone reception is good on average compared to the average of all other users, is that satisfying?Arcadia uses this full granularity, on the order of hundreds of billions of records, so you can create any visual from any source in your Hadoop data for which you are authorized, directly in your web browser; then, drill to the raw record. In other words, source data needs to be securely drillable.
Arcadia Data unifies visual analysis, BI and data discovery across hundreds of billions of records, in an integrated platform software that runs directly on your Hadoop cluster
- Concurrency Even if you have the entire Hadoop cluster and all its data at your disposal, it wouldn’t help if the system is bottlenecked by a hundreds of simultaneous queries. Arcadia uses Impala’s concurrency capabilities including dynamic resource pools and cost-based query compilation, to ensure your Hadoop cluster is responsive to hundreds of concurrent users in human real time.
- No more “cube first, ask questions later” Almost all data visualization technology has been premised on extracts — quite literally, subsets — in which the data takes a one-way trip to the end user. Because joins can be costly and difficult to optimize, ad-hoc joins are frowned upon and tools for building OLAP cubes have proliferated. But as a customer representative on the phone, you don’t want to tell the customer that you can’t answer her question until you submit a ticket to IT to build a cube.
Using Impala, we designed Arcadia to make data arbitrarily accessible quickly, using an approach known as “analytical views,” which track ad-hoc query behavior across the end users who design and consume the visualizations in Arcadia. Analytical views kick in automatically, and can optimize for joins, distinct counts, medians, and of course rollup/drilldown, and so on. That way, end users don’t need to create and continuously refine their own cubes.
Tackling secure granularity, concurrency, and cross-data queries simultaneously is a tractable approach given one key assumption: you let users of the data define what they need and when they need it, using visual analytics. It’s exactly what data scientists have been doing with advanced programming environments — but instead using visualizations to begin with. The key outcome of this approach is that it lowers the barrier to entry for big data. Broader access among business users, along with the increased operational demands on Hadoop, helps accelerate the ROI of running Cloudera Enterprise as the hub of the data-centric enterprise.
Now, with Impala at the core, big data can have an even bigger audience. Performance and scale can be made accessible to all skill levels, even harried customer service reps. After all, when you call the phone company, don’t you want them to know just how important you are as a customer, even if the person who handles your call is not a data scientist?
Co-Founder and CTO, Arcadia Data
Shant is responsible for long-term innovation and technical direction at Arcadia Data. Previously, he was with Aster Data (later acquired by Teradata), where he was an early member of the engineering team and worked on numerous features across the stack, including high performance cluster inter-connects, data storage, compression, and distributed query planning. He holds a PhD in computer science from UCLA.