At Cloudera, we face a challenge that’s pretty unique in the world of software development.
As an enterprise software company, our customers very reasonably expect that we’ll deliver and support software with the level of stability and consistency expected of other enterprise software offerings. There’s no change in the quality bar for open source projects vs. closed source, and given the growth of the Apache Hadoop ecosystem, the majority of our customers are what you’d consider mainstream large enterprises, not early adopters of advanced technology. Big Data has become ubiquitous and almost every enterprise out there is using or is looking to use Hadoop to get better insight from their data (at a fraction of the cost of traditional DBMS, of course).
At the same time, as an open source software company, our role is to be part of the community that drives innovation in the projects we work on. In our case, no one including Cloudera has the majority of contributors and committers to the Apache projects we work on, so it’s a real community effort. There isn’t one community but over 20 different open source projects we need to integrate to provide our solution; each project however has its own governance (project management committee, or PMC, in Apache-speak), committers, and thus culture and goals. None of our customers, of course, are interested in those details and challenges but look to us to mask this disarray and present an integrated package.
Furthermore, while maturing, the Big Data ecosystem is still evolving rapidly. Our customers have an intense appetite for radical new capabilities that open up frontiers in the applicability of this software. As such, the community and the ecosystem is incredibly dynamic. So to sum it up, we have a rapidly evolving ecosystem with multiple distinct cultures and communities individually driving pieces of that ecosystem. At the same time, our customers expect what looks like a centrally architected, well-hardened software solution.
The challenge, then, is providing enterprise quality software in a rapidly evolving ecosystem that spans 20+ independent software projects with no core organization providing focused guidance. Early in 2015, we started to see strong evidence that quality was becoming more of an issue for Hadoop. There were two key causes. One was a coincident set of releases with significant new capabilities but a large amount of new code (primarily in HDFS and Hive). Yet, in Eric Raymond’s essay, “The Cathedral and the Bazaar,” he highlights how the open source process – the “bazaar” – can dramatically help software quality and illustrates this through the early history of Linux.
A change in our user base was the other cause. When Cloudera and Hadoop started, our user base was primarily tech firms and other early technology adopters; now it had evolved and grown to be dominated by mainstream enterprise customers across the Fortune 8000.
With that, our users had shifted from teams of developers who often delved into the Hadoop source code themselves, to a group who were focused primarily on the implementing projects and business applications on top of our CDH, but spent almost no time with the source code. As such, we had stopped following Raymond’s observed rule that “with enough eyeballs, all bugs are shallow.” This code now was being stressed in ways only enterprise IT environments can, but without the benefit of a large group of developers diving into what the shortcomings were in those environments – the build of committers was and remains folks in the tech companies and early adopters.
We realized we had to counterbalance the rapid evolution of features in the Hadoop ecosystem with an equal amount of focus across quality, and especially quality aimed at the use cases seen in enterprise environments. Yet, we had no desire to give up on the leadership role we’d taken in advancing and evolving Hadoop, so our approach to quality had to be one that could be advanced by a relatively small and nimble team; it had to develop an approach that made extensive use of automation vs. brute-force manual testing.
In June of 2015, we diverted a significant portion of our engineering team to focus on this quality automation. At this point, these frameworks are up and running and most of those teams have gone back to their “day jobs.” The impact has been notable: We’ve seen a dramatic drop in serious customer issues per cluster and per deployed node in release last quarter (CDH 5.5), and the timing of this release meant it only benefited from about a third of the effort.
We’re pretty excited about these results and feel that the approaches we’ve taken will be generally applicable to other open source software that evolves into mainstream adoption. And, of course, Cloudera is growing rapidly so if you find this sort of work fascinating and want to contribute, let us know.
To learn more details about how this process is done, read the first installment of the new “Quality Assurance at Cloudera” series in the Cloudera Engineering Blog.