A lively, diverse ecosystem of Apache products has developed in the big data space. This ecosystem is structured to evolve so that it may survive decades, potentially outlasting any individual component. Central coordination would limit this ecosystem.
Apache has no strategic goals for its projects. It is blind to the functionality of software that its projects produce, concerned instead that each project’s community collaborates openly and fairly. Apache thus welcomes competing projects.
The current Apache big data ecosystem includes:
- Schedulers like YARN & Mesos
- Compute engines like MapReduce and Apache Spark
- A multitude of SQL engines, including Apache Hive, Apache Drill, and Apache Phoenix
- Keystores like Apache HBase and Apache Accumulo
- Machine learning libraries like Apache Mahout and Spark’s MLlib
- Stream-processing systems like Apache Storm, Apache Kafka, and Spark
and so on.
In each area, multiple solutions implement similar functionality. While this can be confusing for consumers, it is in fact critical to the ecosystem’s longevity, providing those consumers with a platform that keeps improving.
Each project of course individually evolves to improve itself. But a given project can only change so much without losing its integrity. Fortunately, the ecosystem itself can also evolve, creating new projects that supplant or augment existing projects. No component is irreplaceable. Each must compete for users.
This Apache Big Data ecosystem has no center. Yes, Apache Hadoop’s HDFS is nearly universal at present, but it too has competitors, and there’s no guarantee that it won’t be replaced in the long term. That’s a good thing, as it forces HDFS to respond to challengers. And should its use eventually fall, that will be to the benefit of users, as a superior alternative will have arrived.
Vendors curate this ecosystem for their customers, providing tested, supported distributions. This is a valued function, but it is crucial for the long-term interests of those customers that there is no anointed curator. Competition and innovation are required among distributions too for the ecosystem itself to evolve. Vendors can collaborate through the Apache Bigtop project, but they are all free to diverge from one another as needed, taking risks on new projects that might advantage their customers.
Compatibility is an important goal. Each project strives to make its releases backward compatible as does each distribution. Distributions try to be compatible with one another, both to attract their competitor’s customers and to simplify application development. But compatibility can be at odds with progress. Projects and distributions must also be free to make incompatible experiments in order to keep the ecosystem vital.
This ecosystem structure, a loose confederation of independently managed projects, is its long-term strength. No single organization controls its fate. A centrally managed ecosystem would be much weaker. Its controllers would reject changes that threaten their interests, and the whole ecosystem would remain stuck around a static center.
So let us embrace a diversity of distributions just as we embrace the diversity of tools. Together these keep the big data software ecosystem powerful and durable.