In my last post, I presented some high level Apache Spark market research findings. Clearly Spark is becoming an important, if not the important, component in many big data applications. In this post we’ll see some of key motivations for adopting Spark, what users look for in specific Spark offerings, and the biggest challenges to wider adoption.
Spark Cloud Adoption
First, I want to note the growth of Spark in the cloud. Of course, what made big data analytics initially popular was that one could leverage cost-effective commodity infrastructure even as data sets scaled to petabytes. And if you didn’t have your own cluster, or one big enough for your current task, you could cheaply rent one temporarily in a public cloud.
Today, on-premises Spark deployments still dominate the landscape (reported at more than 50%), as cloud providers still work to prove secure, compliant, and cost-effective in a wider class of scenarios. Yet, we note there is strong interest in transitioning many on-prem deployments to a cloud going forward. Our survey projected that cloud deployment (IaaS and/or PaaS) is expected to increase from 23% today to over 36% in the near future, with a bump-up increase in Spark SaaS (from 3% today to 9%). We note that Cloudera (originally named for the “cloud era”) is equally valuable on-premises and in cloud deployments, and in fact provides for platform consistency across environments, enabling agility to migrate, broker or hybridize infrastructure under the hood and lower overall investments in application infrastructure.
This cloud shift isn’t limited to Spark, or even big data, but correlates with other research into broader IT cloud adoption trends (e.g. containerization, hybrid data storage).
Spark Solution Findings
Looking deeper into why Spark itself is so popular compared to other solutions, we can easily confirm that improved performance was key for most (74%), Advanced Analytics including machine learning helped half decide (49%), Stream Processing is a surprisingly popular motivation (42%), while Ease of Programming helped convince many (37%).
When it came time to select a partner for Spark, organizations cited the quality of support as the most important selection factor (46%), followed by a demonstrated commitment to open source (29%), affordable enterprise licensing costs (27%) and the support for cloud deployment (also 27%). Again we see that Cloudera is well positioned with its enterprise support model on-premises or in-cloud and a long history of open source innovation and contribution.
When we asked about internal challenges to wider Spark adoption, six out of 10 active Spark organizations reported a significant skills/training gap, while more than a third mention complexity in learning/integrating Spark. Some report significant cultural barriers (26%), open source concerns (18%) and lack of enterprise management features in their current source (15%) as remaining challenges.
We note that compared to many previous big data analytics platforms, Spark today offers a higher—and often already familiar—level of interaction to users through its support of Python, R, SQL, and seamless desktop-to-cluster operations, all of which no doubt contribute to its greatly increasing popularity and incredible rate of adoption. And while the majority of adopters are enthusiastically running down field, Spark (even with R/Python and SQL-friendly syntax) isn’t intended to be the next MS Excel. It requires some programming chops, some study of distributed/deferred data processing, and often a big data mental paradigm shift. Here, we see Cloudera as having a significant Spark market advantage as well – Cloudera has long history of excellence with big data training of all kinds – use cases, technologies, operations…
I’ll finish this post by pointing out that taking advantage of any new technology requires expertise that goes beyond specific technical knowledge. One must really understand the opportunities, use cases and implementation options, and then align those with the available resources in a realistic implementation plan. Let me restate the obvious – Cloudera is the current big data (and Spark) market leader, has a tremendous professional services organization, offers training in almost every use case you can imagine, and has built up a robust, worldwide partner ecosystem. If you have a potential Spark opportunity, you would be well served to start here.