Announcing Workload Analytics for Cloudera Altus

Categories: Cloud Data Engineering

When we announced Cloudera Altus, we called out three guiding principles that led us to reimagine running big data workloads in the cloud: simplicity, cost effectiveness, and maintaining the integrity of Cloudera’s trusted, enterprise-grade platform at the core. We decided early on that enabling customers to migrate data engineering workloads (which benefit most from cloud elasticity) would be our first step. But we always knew that there was a big gap to fill in the market around making that migration smoother, prescribing exactly how to extract the maximum cost savings, and giving customers the confidence that mission critical workloads will still meet SLAs in this new world of transient clusters on elastic infrastructure.  In thinking through how to best fill that gap, we knew we needed a cloud-native approach, that existing open source tools and platform-as-a-service offerings often fell short, and that we should design any solution around the three core Altus principles that would help organizations succeed with big data in the cloud.

Today, we’re pleased to announce Workload Analytics, a new capability available for Cloudera Altus. Workload Analytics empowers users to get the most out of Altus by streamlining every phase of their journey to running big data workloads optimally in the cloud. In addition, we’ve made meaningful improvements to the experience of troubleshooting and worked to simplify the ‘black art’ of profiling and performance optimizing complex jobs. Finally, we’ve done all of this in a cloud-native fashion that’s equally at home when using transient, workload-centric clusters but maintains 100% compatibility with existing big data workloads using data processing engines such as Apache Spark, Apache Hive and MapReduce.

One of the first challenges customers face in cloud migration is simply getting each workload to cope with the inherent changes in the platform. These include the use of cloud storage services for input/output data, dealing with the job submission experience of the service, and running on what might be a bespoke cluster where tasks like bootstrapping metadata and making infrastructure choices are left as an exercise for the end user. Workload Analytics helps in this initial phase by reporting any errors with scripts or parameters specified at submission time,  highlighting data permissions issues and other problems accessing cloud infrastructure, and ensuring you’ve met the minimum resource requirements needed for the job to succeed. Previously, just getting these simple answers often required digging through folders to download log files, or worse, finding the critical information to the root cause analysis is no longer available because your cluster went ‘poof’ right after the failure. Now, every log, metric and configuration property is at your fingertips, even after the cluster is long gone, and obvious problems are collected and served up front-and-center.

The next challenge customers often face is right-sizing the infrastructure to the workload.  Above the mandatory minimums, this is a cost/performance tradeoff, but only to a point, and finding the balance was previously a costly trial-and-error of launching clusters that find the scaling limits of the particular workload, or at least the point of diminishing returns (which itself can be expensive).  A simple approach would be to help you track performance over time as you experiment with different size clusters, but we’ve demystified the process further by identifying exactly how much resource starvation occurs at each stage on the first run, or if there’s not at all and you’re just wasting money. This lets you determine what cost or performance upsides are to be had by growing the cluster to fit the workload’s points of peak load, or shrinking it and saving a lot of money in exchange for a small drop in speed.

Some aspects of running big data in the cloud present more of an opportunity than a challenge.  With transient clusters, running a workload well (not just in time to meet its SLA) can save money in ways it can’t on a permanent cluster. This means that identifying skew in data or in the distributed computing of specific stages of your job might unearth significant opportunities to improve on both performance and cost. The same goes for the handbrakes of spilling data to disk or reclaiming memory because there’s not enough available to do the job right. Sometimes simple tuning configuration of the job or cluster can mean orders of magnitude in efficiency.  This opportunity for immediate cost savings is unique to running on elastic infrastructure, so we’ve invested substantially in screening the health of workloads as it relates to their ability to stretch every penny of your cloud dollar.

Finally, and not particularly unique to cloud, is the notion that detecting and monitoring changes in big data workloads over time has been a labor intensive, frustrating exercise that’s ripe for improvement. Whether it’s a flagrant SLA violation, a subtle creep up in runtime, or missing data from an upstream process that throws off that report by a few thousand percent, getting a bead on historical performance has been tough with cluster-centric monitoring tools that aren’t focused on comparing what I know is that pipeline, day after day. Fortunately, Workload Analytics automatically identifies and tracks recurring workloads and detects anomalies in how they run, alerting you to big changes in runtime and I/O, letting you see exactly how much wiggle room you have between you and that unwelcome SLA violation. To dig deeper, users can also compare the job run in question to historical trends using simple graphs or the automatic baselining capability, which continually recalculates the “new normal” and highlights deviations in any of the dozens of metrics provided by the compute frameworks, or even by the savvy Spark developer who registered his own custom metrics.


Screen Shot 2017-06-13 at 1.34.16 PM (6)

The problem domain and concepts here aren’t new to computing, enterprise software, or even to big data in all cases.  But we felt that in the big data ecosystem generally, and in the journey of cloud migration specifically, there’s a real opportunity to deliver more value to users, save money, avoid unhappy surprises, and make life easier in the process. This is just a first step, but we believe that adding a little ‘analytics’ to your big data analytics will ultimately deliver on that. So what are you waiting for? Get onboard!


Leave a Reply