Rethinking Data Marts in the Cloud

Categories: Analytic Database Cloud Enterprise Data Hub


Clouds (source: Pexels)


Many of us are all too familiar with the traditional way enterprises operate when it comes to on-premises data warehousing and data marts: the enterprise data warehouse (EDW) is often the center of the universe. Frequently, the EDW is treated a bit like Fort Knox; it’s a protected resource, with strict regulations and access rules. This setup translates into lengthy times to get new data sets into an EDW (weeks, if not months) as well as the inability to do exploratory analysis on large data sets because an EDW is an expensive platform and computational processing is shared and prioritized across all users. Friction associated with getting a data sandbox has also resulted in the proliferation of spreadmarts, unmanaged data marts, or other data extracts used for siloed data analysis. The good news is these restrictions can be lifted in the public cloud.

A new set of opportunities for BI in the cloud

Business intelligence (BI) and analytics in the cloud is an area that has gained the attention of many organizations looking to provide a better user experience for their data analysts and engineers. The reason frequently cited for the consideration of BI in the cloud is that it provides flexibility and scalability. Organizations find they have much more agility with analytics in the cloud and can operate at a lower cost point than has been possible with legacy on-premises solutions.

The main technology drivers enabling cloud BI are:

  1. The ability to cost-effectively scale data storage in a single repository using cloud storage options such as Amazon S3 or Azure Data Lake Store(ADLS).
  2. The ease that one can acquire elastic computational resources of different configurations (CPU, RAM, and so on) to run analytics on data combined with the utility-based cost model where you pay for only what you use. Discounted spot instances can also offer a unique value for some workloads.
  3. An open and modular architecture consisting of analytic optimized data formats like Parquet and analytic processing engines such as Impala and Spark, allowing users to access data via SQL, Java, Scala, Python, and R directly and without data movement.

These capabilities make for an amazing one-two-three punch.

Because the cloud offers the ability to decouple storage and compute, all of an organization’s data can now live in a single place, thus eliminating data silos, and departments and teams can provision computes to run analytics for their use cases as needed. This new arrangement means self-service BI and analytics are a reality for those who adopt such a model. And with an open architecture, there are no worries about technology lock-ins.

Architecture patterns for the cloud

Now that we’ve discussed what technology options there are for BI in the cloud, what are the considerations an organization should think about?

Generally speaking, there are two common use cases for BI and analytics in the cloud that map to the two main architecture patterns: long-lived clusters and short-lived (or transient) clusters. Let’s discuss each in more detail.

Transient (short-lived) clusters for individuals or small teams

Often, data analysts, data scientists, and data engineers want to investigate new and potentially interesting data sets, but would like to avoid as much friction as possible in doing so. It’s quite common for data sets to originate in the cloud, so storing and analyzing them in the cloud is a no-brainer. Such data sets can easily be brought into S3 or ADLS in their raw form as a first step.

Next, a cluster can easily be provisioned with the instance type and configuration of choice, including potentially using spot instances to reduce cost. Generally, instances for transient clusters need only minimal local disk space, since data processing runs directly on the data in the cloud storage. There are tools, like Cloudera Director, that can assist with the instance provisioning and software deployment, making it as easy as a few clicks to provision and launch a cluster. Once the cluster is ready, data exploration can take place, allowing the data analyst to perform an analysis. If a new data set will be created as part of the work, it can be saved back to the cloud storage.

Another advantage to compute-only clusters is that they can easily and quickly be resized, allowing for growth or shrinkage, depending on data processing needs. When the analysis is finished, the cluster can be destroyed.

One of the main benefits of transient clusters is it allows individuals and groups to quickly and easily acquire just-in-time resources for their analysis, leveraging the pay-as-you-go cost model, all while providing resource isolation. Unlike an on-premises deployment in which multiple tenants share a single cluster consisting of both storage and compute and often compete for resources, teams become their own tenants of a single compute cluster, while being able to share access to data in a common cloud storage platform.

Long-lived clusters for large groups and shared access

The other common use case for BI and analytics in the cloud is a shared cluster that consists of many tenants. Unlike the transient cluster that may only run for a few hours, long-lived clusters may need to be available 24/7 to provide access to users across the globe or to data applications that are constantly running queries or accessing data. Like transient clusters, long-lived clusters can be compute-only, accessing data directly from cloud storage, or like on-premises clusters, they can have locally attached storage and HDFS and/or Kudu. Let’s discuss the use cases for both.

Long-lived compute-only clusters

For multitenant workloads that can vary in processing requirements over time, long-lived compute-only clusters work best. Because there is no local data storage, compute-only clusters are elastic by definition and can be swiftly resized based on the processing demands. During peak demand, a cluster can be scaled up so that query times meet SLA requirements, and during hours of low demand, the cluster can be scaled down to save on operational costs. This configuration allows the best of both worlds—tenants’ workloads are isolated from each other, as they can be run on different clusters that are tuned and optimized for the given workload. Additionally, a long-lived compute-only cluster consisting of on-demand or reserved instances can be elastically scaled up to handle additional demand using spot instances, providing a very cost-effective way to scale compute.

Long-lived clusters with local storage

While elastic compute-only clusters offer quick and easy scale-up and scale-down because all data is remote, there may be some workloads that demand lower latency data access than cloud storage can provide. For this use case, it makes sense to leverage instance types with local disk and to have a local HDFS available. This cloud deployment pattern looks very similar to on-premises deployments; however, it comes with an added benefit: access to cloud storage. As a result, a cluster can have tables that store the most recent data in partitions locally in HDFS, while older data resides in partitions backed by cloud storage, providing a form of storage tiering.  For example, the most recent month of sales data could reside in partitions backed by local HDFS storage providing the fastest data access, and data older than one month could reside in cloud object storage.

In summary

Deploying data marts in the cloud can help an organization be more agile with BI and data analytics, allowing individuals and teams to provision their own compute resources as needed while leveraging a single, shared data platform. If you’re interested in learning more about how to architect analytic workloads, including the core elements of data governance, for the cloud, be sure and attend my talk at Strata Data in Singapore, Rethinking data marts in the cloud: Common architectural patterns for analytics.


Leave a Reply