Almost 10 years ago I helped found Hadoop at Apache. Since then the project has seen tremendous success, spawning an ecosystem of over 20 projects around it. Institutions throughout the world use Apache Hadoop and related projects to better understand their customers, markets and products. Hadoop has been central to the ongoing shift by enterprises to a new computing platform that is more powerful, scalable and flexible.
Hadoop’s remarkable success is due to a number of factors. Its timing was opportune. Hardware is ever more affordable and industries are using computers in ever more places, from the web to mobile devices and sensors, generating data that reflects business activity. Software that helps to harness all this data is a welcome addition.
Hadoop met this need with an appropriate technology. Running on commodity hardware greatly increases the affordability of its approach, letting folks store and analyze vastly more data than before. Its general purpose approach lets folks store things once and then explore them in a variety of ways, greatly increasing institutions’ agility.
But Apache is equally responsible for Hadoop’s success. As I learned earlier with Lucene, Apache’s approach to open-source has a virtuous cycle. Folks more readily adopt software that is unencumbered with commercial restrictions and supported by a diverse community. More users leads to more contributors, improving the quality of the software and growing the community. This dynamic creates long-lived projects that maintain high-quality software.
Seven years ago, Cloudera was founded as the first company dedicated to helping folks use Hadoop and its related software tools. I joined Cloudera a year later. Cloudera works with the communities at Apache to enhance these open-source tools. Sometimes we identify gaps in the ecosystem and instigate new projects. Apache Sqoop, Apache Flume and Apache Parquet are past examples of projects founded by Cloudera and brought to Apache. In each case Apache’s processes have enabled these to grow into industry standards.
Today I am pleased to announce that two more projects from Cloudera will be submitted to the Apache Incubator: Impala and Kudu.
Impala is a high-performance SQL engine, optimized for analytic use cases. Cloudera originally released Impala as Apache-licensed open-source in 2012. Since then, the project has seen impressive momentum and wide adoption by customers, vendors, and partners alike. We have also gradually started to see contributions from other organizations, including Google and Intel. It is now time to turn management of the project over to an independent community at Apache so that even more folks can get involved in advancing Impala.
Kudu is a recently-announced storage backend. It stores structured data, complementing the capabilities of HDFS and HBase by offering a combination of fast updates and fast scans not previously available to the Hadoop ecosystem. Kudu already integrates with a few other ecosystem components, like MapReduce, Spark, and Impala, but will greatly benefit from a diverse Apache-based community to increase its integration and generally improve this young project.
I helped to launch the Hadoop ecosystem, but I never imagined it would become this strong. Tools like Kudu and Impala significantly expand the scope of applications that can be created on this platform. It’s exciting to submit them for incubation at Apache, where they can best grow to their full potential.