Driving Standards for Big Data Ingest

Categories: Compliance General Open Source Software Partners

At Strata + Hadoop World in London this morning, we announced several new efforts to encourage standardization for data ingest into big data platforms.

First of all, together with Confluent, the Apache Kafka company, we announced a collaborative effort to publish the API compatibility tests we’ve each developed as part of the Kafka project. We’re each committing to the continued enhancement and maintenance of those tests, and inviting the community at large to join the effort to extend them as the Kafka project continues to evolve.

Cloudera will separately publish our API tests for the Apache Flume and Apache Sqoop projects into the code repositories for those projects, respectively. Again, these tests will become part of the standard projects, and will evolve alongside the software that they test.

Correctness testing is important, of course. Like most software vendors in the world, over the years we have built up a considerable collection of QA tools to confirm that the enterprise software we ship works correctly. Those tests are sometimes encumbered by private test data from customers, and often depends on our internal testing infrastructure, so releasing it isn’t always easy. In this instance, though, making these tests available to the community at large offers some important benefits.

Every customer we work with, and every adopter of our open source platform, has the same first problem: Getting data into their big data system. Trivial as that sounds, it’s often a big challenge. Connecting to existing data sources, transforming and formatting the data correctly on ingest and making sure that it lands safely and correctly is hard. Doing all of that continually, with very large data sources steadily producing bits, requires enterprise-grade tooling.

Each of Kafka, Flume and Sqoop has emerged in recent years to tackle different sources and to integrate with different feeds. Kafka originated inside LinkedIn and is now supported by Confluent. Cloudera created Flume and Sqoop several years ago. All three of these systems are now shipped by virtually all vendors in the big data space, and are widely used by systems integrators, ISVs and customers who deliver services, build products or work with data.

By providing a consistent set of API tests, we make it easier for all parties to ensure that the data ingest pipelines they build work correctly. More importantly, we make it easier for those systems integrators, ISVs and customers to build their pipelines and deploy their tools with confidence. They can verify that the ingest tools they rely on correctly implement the APIs, as defined by the open source communities that build them.

Because all of the tests are part of their respective projects, they’re freely available to anyone. No fees are required to use them and no membership is needed to participate. The Apache community at large can contribute under the normal rules. Any company can use the tests, and can even build a commercial testing and validation service or product that it can offer to the market. The Apache License encourages that diversity by permitting use under wide latitude.

Clearly, both Confluent and Cloudera benefit from increased adoption of our platforms. We each understand the dynamics of the open source software market, and recognize the value in releasing our work in this way, for use by our customers, our partners and even our competitors. By encouraging widespread use, we increase the confidence of the market at large in new, next-generation tooling for data collection. That expands the market for all players, notably including the two of us.

Although we had to work hard over the past weeks to put this effort together, we were able to brief a number of our partners and customers on the effort. We’re pleased to have the endorsement of so long a list; you can see them in the press releases that I linked to above. Timelines didn’t permit us to approach all the players we’d have liked to include, but we expect to see adoption of the new conformance and correctness tests match the very broad embrace of Kafka, Flume and Sqoop themselves.

I’m really pleased with this effort; we hope to do more of this sort of thing in the near term. I’d like to thank Jay Kreps, CEO at Confluent, for his enthusiastic embrace of our shared effort, here. Big data is a big deal. Making it easy to ingest data for processing and analysis is critical. Jay and Confluent are fantastic partners to Cloudera and tremendous members of the open source community.


3 responses on “Driving Standards for Big Data Ingest

Leave a Reply