Todd Lipcon is the founder and PMC Chair of the Apache Kudu project, as well as the tech lead of the Kudu team at Cloudera. Previously, he worked on Apache HBase, HDFS, and MapReduce, projects on which he is also a committer and PMC member.
This week, the Apache Kudu team announced the release of Kudu 1.0. This release marks the one-year anniversary of Kudu’s public debut, and is the culmination of much hard work by a growing team of developers and community members.
In this blog post, I’ll recap the original vision for Kudu, review our accomplishments over the last year, and share where I see the project going in the future.
The Origins of Kudu
Though Kudu was first publicly revealed at the 2015 Strata+Hadoop World conference in New York, its story begins much earlier. In late 2012, I had been working at Cloudera for nearly four years, and had spent most of that time contributing to Apache HDFS and Apache HBase, the two storage options available at the time in the Hadoop ecosystem. The experience I had gained was terrific, and I was proud of the contributions I was able to make. But, I had started to get the itch to do something new.
After wrapping up my work on automatic failover for HDFS, I was given the green light to spend some time exploring how we could best expand the Hadoop ecosystem to take on new and different workloads. After thinking back on my experience working with early users of HDFS and HBase, as well as researching industry trends, my thoughts coalesced in a few key areas:
- Query engines were about to get much faster. The first release of Apache Impala was imminent, and Impala would expand Hadoop from a batch-only system to something that could be used for real-time, interactive SQL analytics. This was going to require tabular storage that was fast and efficient.
- We were on a collision course with solid state and memory storage. The price of flash storage was dropping rapidly, and customers were willing to purchase more and more memory per node if it would buy performance. This meant that CPU would become the bottleneck for many workloads, rather than magnetic disks.
- Online systems and analytics were converging. Streaming and real-time architectures were beginning to become popular, as businesses realized that the faster they could incorporate new information into their analytics, the more competitive they would be. We needed a storage system that could live in both worlds.
- The current options were inadequate. Users were able to get great batch performance out of HDFS, and great random-access inserts, updates, and deletes out of HBase. However, HDFS wasn’t designed for random access, and HBase wasn’t designed for relational analytics. Users were spending inordinate engineering effort to bridge this gap, duplicating data in multiple systems and building elaborate workarounds to synchronize or migrate data.
Through some combination of luck and persistent nagging, I got approval to begin prototyping a system to attack these problems. Though the specifics evolved since the early days, the plan was always to build a next-generation open source tabular storage system with three core features:
- Integrate natively with other tools in the Apache Hadoop ecosystem
- Allow simultaneous online random access (insert/update/delete) with fast analytic scans
- Take advantage of the newest hardware technology
On 10/11/12, I committed the first code to the Kudu repository with the description: “code for writing cfiles seems to basically work. Need to write code for reading cfiles still.” Clearly, there was a long way yet to go.
The Road to Kudu 1.0
Three years and many internal milestones later (including building a terrific team), the first public open source beta of Kudu was unveiled on September 28, 2015. In this beta release, Kudu contained most of the core functionality one would expect from a database storage engine, including:
- Integration with Apache Impala, Spark, and MapReduce for analytics, with performance rivaling Apache Parquet for many workloads
- Support for low-latency random inserts, updates, and deletes via Java and C++ APIs
- Fault tolerance and scalability into the 100s of nodes
With this core functionality, we began to build a community of users and developers. To that end, Cloudera contributed the project to the Apache Software Foundation, where it underwent incubation and graduated as a Top-Level Project (TLP) in July. The project now includes code written by more than 50 contributors, and multiple businesses are running Kudu in demanding production environments, even despite its pre-1.0 version number!
In addition to these important accomplishments, the team has been hard at work adding substantial new functionality over the last year. Some of the major improvements since our first beta are:
- Support for redundant and highly available Kudu Master nodes
- Support for manual management of range partitioning, critical for time series workloads
- Substantially improved integration with Apache Spark, including Spark SQL
- Support for the ‘UPSERT’ function popular in many NoSQL-style databases
- Integration with Apache Flume
- An officially supported Python client library
- Substantial performance improvements both for random access and analytic workloads
With all of these improvements, plus hundreds of other smaller features, improvements, and bug fixes, the team now feels that Kudu is stable enough for critical workloads at a wider range of companies. To represent its stability and complete set of core features, we advanced the version number to 1.0.
What’s up next?
Now that Kudu 1.0 is out the door, we’re already hard at work on upcoming releases. Here are a few of the items that the team at Cloudera will be working on in the coming year:
- Security. Kudu currently has no support for Kerberos security or access control. These features are critical for many use cases, and we’ll be working hard to add them.
- Operability. While operations have always been a focus of our development, as we move into more and more production workloads, the importance of having great tooling for operators only increases.
Performance. Kudu is fast today, but we’ve got a whole roadmap ahead of us to make it even faster. This includes items like support for next-generation persistent memory hardware, as well as big gains in the performance of SQL engines on Kudu.
Scalability. Kudu users are already running on 200 node clusters today, but we plan to continue working on stability and performance at scale.
Of course Kudu is an open source project, and I’m sure we’ll work on hundreds of other items based on the demands of the community and early adopters. If you’re using Kudu, or even just interested in learning more, join the Kudu Slack chat room or mailing list to put in your two cents.
We are proud to have reached this important milestone, but we are far from finished. In fact, Kudu 1.0 is only the beginning. Just as the Hadoop project recently celebrated its tenth birthday, I hope and expect that Kudu will enjoy the same longevity. So, this weekend I’ll be opening a bottle of Champagne not to celebrate the last four years but rather to toast to the next four. And on Monday morning, I’ll be back in the office pushing code for Kudu 1.1.