As an IT industry analyst (and former technical product manager), I’m always fascinated with how enterprises large and small adopt new technologies. What does it take for a new solution to not only present a compelling opportunity, but also prove itself ready for prime time? What separates out the eventual market dominating solution from all the shiny new technologies that seem to show up every day?
I’ve been a big fan of all things big data from the very beginnings of Apache Hadoop, but I admit that much of the initial low-level parallel programming required was beyond my casual hacking skill set. Still, that was what Ph.D. data scientists and supercomputer programmers were for, right? Of course, the resulting skill and accessibility gap hindered wider big data adoption across the broader IT marketplace. In response, we’ve seen years of “up-leveling” technologies emerge that help narrow that gap.
Now, having encompassed the lessons of a decade of commercial big data market evolution, Apache Spark looks poised to close the gap, bringing significant big data superpowers within reach of even us average geeks. Personally, I really enjoy using Spark’s Python API, structured datasets, machine learning libraries. I can use my laptop to play around, and if needed deploy the exact same code on cheap cloud infrastructure over really big data. For me it’s almost a perfect storm (almost perfect – now if there was only a fully supported Ruby equivalent to those Python libs…).
Spark Market Study
I do know that my personal opinion doesn’t really prove anything; so here at Taneja Group we decided to do some deeper Spark market research, and in partnership with Cloudera recently conducted a thorough survey of nearly seven thousand (6900+) highly qualified technical and managerial people working with big data from around the world.
In our ongoing research we’ve already explored general perceptions about big data and Spark, and dived into specific use cases, realized productivity gains, and the experienced challenges to wider adoption. Within our highly qualified survey population of Spark users, data scientists, engineers, consultants, admins, big data managers and IT professionals, over 40% make or heavily influence big data purchasing decisions while over 30% set corporate technical requirements or design specs.
The first thing we noted was that across the broad range of industries, company sizes, and big data maturities represented, over one-half (54%) of respondents are already actively using Spark to solve a primary organizational use case. That’s an incredible adoption rate compared to other newly empowering and disruptive technologies given the relatively short time that Spark has been production qualified.
Second, almost two thirds of active Spark users (64%) are planning to notably increase and expand their usage within the next 12 months. And we should note that another four out of ten of those familiar with Spark but not yet using it have made plans to adopt Spark soon. What this tells us is that the overall Spark market not only already covers a large majority of organizations processing big data, but has become a default (if not the future standard) go-to solution.
Cloudera clearly recognized Spark’s potential early on, and is currently primed to fully capitalize on Spark’s popularity. We found that 57% of active Spark users have already adopted Spark from Cloudera Enterprise for their most important use case. This high capture is no doubt due to Cloudera’s enterprise security model and ready integration with their complete and highly reliable big data distribution that enables analytic, data processing, and machine learning workloads.
Cloudera may very well capture even more of the growing Spark market given the advanced big data use cases indicated in our research. In addition to the expected Data Processing/Engineering/ETL use case (55%), we found high rates of forward-looking and analytically sophisticated use cases like Real-time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). These are all areas in which Cloudera already has market leadership – and available services (reference models, training, recipes, value consulting, etc.).
And support for the more traditional customer intelligence (31%) and BI/DW (29%) use cases weren’t far behind. By adding those numbers up you can see that many organizations indicated that Spark was already being applied to more than one important type of use case. The trend towards deploying Spark as a platform for multiple workloads also falls into Cloudera’s favor, as their platform is capable of supporting many different kinds of organizational needs in one distribution.
Diving a little deeper, near half of current users (48%) said they used Spark with HBase and 41% again also with Kafka. We believe that Spark will continue to grow in its importance for an increasing number of workloads. It’s coupling and alignment with the broader Hadoop ecosystem is what is making the end outcomes more realistic. It is evident by this research that Spark truly blossoms when fully enabled by supporting other big data ecosystem components.
Production big data solutions are actually pipelines of activities that span from data acquisition and ingest through full data processing and disposition. Cloudera clearly strives to deliver the best performing, most reliable set of technologies with which to build production big data pipelines, and so likely has a built-in market “lead-in” when it comes to Spark growth in the broader IT enterprise market.
In my next post I’ll review some of the key motivations for adopting Spark, what users look for in specific Spark offerings, and the biggest challenges to wider adoption. In particular, we’ll look at the growth potential of Spark in the cloud.