When I joined Cloudera several years ago, I remember that during the new hire orientation, there was a lot of discussion about the need to “move compute to the data”, compared to the legacy approach of moving data to the compute. The claim was that data assets have become so large – in many cases, dozens of petabytes – that the only feasible way to perform analysis was to run analytic workloads on the hardware that hosted the data, rather than the traditional approach of copying data to the hardware that hosted the analytic engine.
Fast forward a few years later, and a number of related trends have emerged:
- Organizations now run diverse, multidisciplinary big data workloads that span analytic databases, operational databases, data engineering applications, and data science applications. Many of these workloads operate on the same underlying data.
- Workloads can be transient or long-running in nature, and they might run in a public cloud, a private cloud, on-premises, or in a hybrid environment.
- Many users of modern data management systems are casual business users, not just hard-core data scientists.
A number of challenges have come up as a result of these trends:
- Table definitions, access permissions, business glossary definitions, metadata classifications, and governance artifacts – collectively called “data context” – are difficult to keep consistent across a growing number of workloads
- Administrators spend too many cycles re-building the essential data context whenever they create new workloads, especially when those workloads are transient
- Whenever data context changes – for example, when a user is granted access to a new data set – administrators have to ensure that the updated context is maintained consistently everywhere.
Put even more concisely, here’s the challenge:
- Compute is stateless and it exists inside the workload, whether it’s cloud-based or on-premises, and whether it’s transient or long-running
- Data is stateful, and it’s often stored outside the workload, whether it’s in HDFS, Apache Kudu, Amazon S3, Azure Data Lake Storage (ADLS), Isilon, etc.
- Data context should be stateful and stored alongside the data it describes, yet much of it, such as the Hive Metastore and Apache Sentry, currently exists inside the workload. Consequently, this data context is not only lost if a transient cluster goes away, but also inaccessible to new application clusters
Announcing Cloudera SDX
Today, we announced Cloudera SDX to address this challenge head-on. Cloudera SDX is a modular software framework that ensures a shared data experience across all deployment types, including multiple public cloud, private cloud, hybrid, and bare metal configurations. By applying stateful, centralized, consistent data context services, SDX makes it possible for hundreds of different workloads to run against shared or overlapping sets of data. SDX makes multi-disciplinary data applications easier to develop, less expensive to deploy, and more consistently secure.
Our first release of SDX is available in Cloudera Enterprise 5.13. All SDX-enabled workloads will support stateful data context: they will have consistent table schemas, access permissions, and governance artifacts.
The benefits of SDX are immediate:
- Lower cost of ownership: less hardware and software to manage, and common tools for every use case and environment
- Increased end-user productivity: data is presented consistently to users in every cluster
- Increased agility: admins can easily deploy new use cases without recreating data context services in each new cluster
- Lower risk: security and governance are defined and enforced consistently alongside the data, independent of the analytics application
In the coming months, you can expect to see even more innovations as part of our SDX strategy. For now, please take a look at some of the nuts and bolts of the first release of Cloudera SDX here and sign up for an upcoming SDX webinar here.