As with many technology innovations, a troubling trend is to downplay the amount of effort and disruption it requires to make a shift in the way information systems operate. We all understand that in order to make our insights more sound we need to incorporate more data points and more complex data formats. As the technology matures we can start really focusing on use cases that present our organization with long-term strategic value. When we do this we can easily justify the expense of modernization efforts (time, money) in a way that sets our sights on long-term returns. Even with a sound strategy in place it’s often difficult to understand where to start.
Cloudera has released some great materials on how our customers are using Apache Hadoop to augment their existing data warehouse resources including a video series with data warehouse icon Ralph Kimball. From simple ETL offload to more complex active archive scenarios users are learning how to optimize their data warehouse with Hadoop.
An ever present example is as a modern business you might be running several hundred to 1 million queries per day. With so many queries across many different systems, it can be a daunting task to understand what workloads are even powering the business today, or how to best leverage new systems like Hadoop and optimize across it all.
When we look at a day in the life of a database architect (DBA), they are tasked with relieving pressure on their existing EDW, better optimizing ETL, and improving analytic workload performance. But where do they even start? How do they turn 24 hours of workloads that look like the below graph into a more normalized distribution? Look at the chart below, we see common patterns emerge. ETL performance is what is going to feed the reporting and has to happen first. Ad hoc and data discovery need to be available during business hours, and the more complex queries are often written at night.
Often when we talk about an enterprise data hub, we talk about this notion of multi-tenancy (or the ability to run multiple applications with varying degrees of data availability). In the same vein of data access we often plan our load of queries on the systems to meet strategic goals of the business. Cloudera Enterprise is designed to facilitate a wide variety of workloads, showcased by the development focus on a powerful general processing tool, Apache Spark, alongside purpose-built tools like Impala. Even on our own internal data hub, a multitude of applications are working in concert. Understanding the strengths of each tool and how they can work together is a great first indicator of where to optimize our data systems. Below is an example of how we might approach normalizing our query load in order to properly optimize our data environment.
After we have considered which areas we want to target for optimization we then need to discover how, at an individual query level, we can hope to optimize. This is often where even good intentioned expertise falls short as there is no single one-size-fits-all approach. After all, every company is different so you should have a query strategy that’s tailored to you. Some of the most common information DBA’s need to begin rethinking their query strategy can be boiled down into a simple acronym we call DCC. DCC stands for Duplication, Complexity, and Compatibility. Let’s look at how these areas cause complex problems for DBAs.
Duplication: Across business units, with individual users running queries, often the same queries are interesting to many different users. There is often a surprising amount of duplication among workloads. These workloads need to be identified as there’s the opportunity to batch similar workloads together and identify the workloads that will make the largest impact moving to Hadoop.
Complexity: Complex queries can introduce an unwanted tax our your data system performance and require multiple engineers to manage and maintain. These queries are often extremely long and patched together, making them prime for optimization but nearly impossible to figure out where to start or how to run them (e.g. Impala or Hive).
Compatibility: Once we have identified the workloads, we need to understand which are a good fit for Hadoop components like Impala and Hive. We also need to look at which queries aren’t worth the effort initially. While you can try out workloads in a testing environment, this can take time to determine which are the best fit and lead to wasted development effort. What if you need to make the shift faster?
Enter Cloudera Navigator Optimizer (limited beta), the newest part of the Cloudera Navigator suite. Cloudera Navigator is the only integrated data management and governance solution for Hadoop and Navigator Optimizer provides critical visibility and optimization to further the data management capabilities. We have outlined some basic use cases to give you an idea of where Navigator Optimizer can help in your modernization efforts. Navigator Optimizer provides insights and intelligent optimization guidance centered around this DCC concept.
Duplication: Navigator Optimizer gives you the ability to identify duplicated queries. Batching similar queries into a single query can alleviate a huge load on your system. Say 70% of your queries are duplicated, simply by migrating that query to Hadoop you have freed up 70% of your performance resources.
Complexity: A shift to Apache Hadoop is a great time to assess the complexity of your queries running. A simple shift or migration of that query to Hadoop can free up your data warehouse or database to handle the best material for that technology. It is also important to understand which queries are most compatible with which Hadoop tools (Impala, Hive).
Compatibility: Without an exhaustive trial environment in which users are testing familiar semantics on new platforms it’s hard to understand how queries will perform on new systems and the development work it might take. Like any system, Hadoop has its own SQL tools for ETL (Hive) and BI/Data Discovery (Impala). It’s not always a one-to-one move to run workloads with these tools though. Getting guidance on which are compatible out of the gate and how to best tailor your existing queries to be executed by these tools can shorten the time it takes for you to migrate to new technology.
We will continue to provide you tales of success from the front lines of how people are augmenting their current capabilities with Apache Hadoop.
To see how Navigator Optimizer can help you get the best results with Hadoop, we encourage you to sign up for our limited beta.
To learn more about Navigator Optimizer, register for the webinar, “Unlocking Hadoop Success with Cloudera Navigator Optimizer”