Apache Hadoop has changed quite a bit since it was first developed ten years ago. A word that once only meant HDFS and MapReduce for storage and batch processing now can be used to describe an entire ecosystem, consisting of dozens of different components and supporting a wide range of processing that go well beyond batch.
Much of this innovation has been centered around the access layer – with the addition of tools such as Impala for interactive SQL, Apache Spark for general data processing, and Apache Solr for full-text search. In contrast, the storage layer – consisting of HDFS and Apache HBase, which leverages HDFS as a filesystem – has not experienced the same level of new entrants. But as the types of use cases for Hadoop grew, it became clear that another storage option was necessary.
Using the existing storage options, developers had to make choices based on their capabilities. HDFS provides fast analytics – scanning over large amounts of data very quickly. However, HDFS was not built to handle updates. If data changed, it would need to be appended in bulk after a certain volume or time interval, preventing real-time visibility into this data. HBase, on the other hand, complements HDFS’ capabilities by providing fast and random reads and writes and supporting updating data. But this online access came at the cost of scan performance. While these two storage engines addressed many of the key needs for big data applications, there was still a gap, especially for developers wanting fast analytics on fast-changing data.
With growing trends such as the Internet of Things, wearables, and machine data analytics, the desire for real-time analytics increased and this gap became more and more evident. To harness the opportunity of real-time analytics, developers were forced to create complex data pipelines in HDFS or build complex architectures that move data between HBase and HDFS. There needed to be another way.
We are thrilled to announce Kudu, the new, native storage layer for Hadoop that enables fast analytics on fast data. This Apache-licensed open source project complements the strengths of HDFS and HBase, and drastically simplifies the developer experience for building real-time analytic applications on changing data. As an integrated component, it works seamlessly with the popular access frameworks already in use today, including Impala and Spark, and will take advantage of the same administration, security, and governance necessary for enterprise workloads as it continues to mature.
As we’ve seen with the breadth of access engines in Hadoop, there is no “one-size-fits-all” solution when it comes to Hadoop. With the addition of the purpose-built Kudu storage engine, developers have the flexibility to address a wide range of use cases with the right tool for the job – both on the access and storage layer.
As Hadoop storage, and the platform as a whole, continues to evolve, we will see HDFS, HBase, and Kudu all shine for their respective use cases.
- HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
- For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
- Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers.
A beta download of Kudu is now available at cloudera.com/downloads along with a tutorial to help you get started. As an Apache open source project (with intent to donate to the ASF incubator), you can also start contributing to this project at getkudu.io.
For more details on the motivations behind Kudu and the architecture, check out the Developer Blog. Also have a look at the Zoomdata: Real-time and Big Data Analytics with Kudu demo.