As Marten Mickos, one of the masters of successfully monetizing open source software, has often pointed out, open source is a “production model”, not a business model unto itself. The relationship between Cloudera’s open source platform and Apache Hadoop is further evidence for this observation — it is through the open source community that this platform is produced and maintained.
Because of that relationship, a symbiotic relationship exists between Cloudera’s platform, CDH, and the upstream Hadoop ecosystem projects it comprises. And that symbiosis has very specific benefits for our users, who can count on:
- Interoperability between upstream projects and the downstream CDH platform (as well as with other downstream platforms with respect to the core),
- Backward compatibility across minor CDH releases,
- Access to new Apache code at a predictable cadence after extensive testing, performance, scalability, and integration work,
- Easier upgrades,
- Resulting ultimately in an optimal combination balancing stability with innovation.
In the remainder of this post, I’ll explain how.
The Compatibility Layer
It’s important to understand that CDH customers, whether free or paid, are using precisely the same Apache bits as you’ll find in upstream releases. In addition, they have access to critical new features and bug fixes that are present in the project repositories that may not yet have been released by their respective communities. Such patches, many of which are written by Cloudera engineers, are “backported” into CDH maintenance releases over time after rigorous testing to certify backward compatibility across releases and thus ensure that existing applications don’t break. Conversely, we will leave out patches, or even entire components (Apache Hive 0.11 being one example — never shipped in CDH), that are proven to break that compatibility.
The key to this process is an “upstream first” policy: all patches written by Cloudera engineers are committed upstream at Apache as an initial step. This policy ensures that the same Apache code is present in the Apache repositories as inside CDH; thus the upstream and downstream stay in synch, and downstream CDH releases maintain backward compatibility over time.
For these reasons, CDH is always straddling the present and future of trunk development. Users get the best of both worlds: stable, released code in combination with curated, forward-looking features and bug fixes.
This policy has another significant side effect: those non-Cloudera distributions that are built according to a similar production model should in theory become “eventually consistent” with CDH with respect to behavior. So although the timing of those upstream commits will vary across vendors, eventually, all platforms sharing this model will converge towards the same Apache code at the core (HDFS, Common, MapReduce/YARN), and the APIs on top of that “compatibility layer” will behave very similarly. By definition, the Apache Software Foundation is where this entire process takes place.
The Innovation Layer
That said, the differences are as important as the similarities. Although Hadoop is commoditized at the core, the various vendor platforms also vary outside that core–with respect to which components they ship, and how they work together. For example, CDH contains search-over-Hadoop functionality based on an integration of Apache Solr and HDFS, which is unique across platforms. Similarly, Cloudera was the first platform vendor to ship and support Apache Spark, while other vendors insisted on the unproven Apache Tez as the way forward. This layer is where innovation occurs and where users, customers, and eventually forward-looking vendors place architectural bets–with the components getting the most bets becoming standards (as evidenced by the vendors backing Tez providing support for Spark now). Without this architectural curation, the platform becomes stagnant and brittle, and eventually is left behind.
Over the next few weeks, look for a series of new posts that spotlight examples of those architectural bets–initially by users, and later by Cloudera–and how they have blossomed into de facto standards.
The Story of Success
Compatibility at the core/innovation at the edges is the story behind every massively successful open source ecosystem, with Linux being the role model. Had customers demanded complete compatibility across all distributions as an absolute requirement, then the failed United Linux Consortium would have been successful, and Linux as a platform would have failed to develop quickly enough to match the feature advantages of incumbent Unix distributions.
Let’s all work together to keep the story of success going for Hadoop under the umbrella of the Apache Software Foundation!