Cloudera and Cask

Categories: Enterprise Data Hub General Open Source Software Partners

Apache Hadoop is just about ten years old, and has come so far in that decade that it’s nearly impossible to recognize. It’s still a hugely powerful, remarkably scalable platform for storing, processing and analyzing big data. It’s grown well beyond its original HDFS-and-MapReduce roots, though, and is now taking on real-time, stream processing and other very complex workloads.

That growth has been driven by a steady advance in the core capabilities of the platform. The storage layer has gotten more sophisticated — it’s not just that HDFS is now fault-tolerant, much higher performance and is truly highly available. HDFS is complemented by alternatives, like Isilon from EMC, that allow customers to turn existing data lakes into enterprise data hubs, working with the data they’ve already got, and not forcing them to move it in order to use it.

More importantly still, the original, batch-mode MapReduce framework is now only one of the ways that applications can work with data. Cloudera offers plenty of other frameworks — Apache HBase and Apache Accumulo for NoSQL access, Impala for high-speed, highly concurrent analytic SQL jobs, Cloudera Search for data exploration and discovery, Apache Spark for very fast stream and memory processing and a rich suite of open source and third-party machine learning and analytic engines. All of these run right inside the enterprise data hub, operating on the data in place.

Developers can build applications that use the framework best suited to their needs. They can combine different frameworks in powerful ways, using Search to find data sets of interest, MapReduce to transform them, Spark to digest and enrich them and to generate tables served out via HBase or Impala on demand. This all works because the data substrate — the storage layer — is shared, consistently secured, integrated with all of the frameworks. There’s no need to move the data to the application.

If you’re a developer building that application, though, life has been pretty tough for a pretty long time. You’ve had to master the intricacies of each framework, understand their separate APIs, know how they behave and design your code to take best advantage of them, singly and in combination. Each of these frameworks arose as a separate open source project or commercial product; each was designed apart from the others. No one thought, early, about consistency, predictability and ease of use for the application developer.

Mature technologies disappear — they get stable, reliable and easy to use, and higher-level abstractions and interface appear on top of them. Business people don’ t think about the relational database they’re using; they work with the financial reporting application, the point-of-sale system, the ERP package that runs on top of it. It’s time for ten-year-old Hadoop to cross that border, and for a simpler, more powerful app framework to hide it from the eyes of its end users.

At Cloudera, we’ve been collaborating with companies across the Hadoop ecosystem to help create that framework. Recently, we’ve seen the team at Cask do some fantastic work in product and embrace an open source strategy that’s just what developers need to help them build big data applications on Cloudera. The Cask Data Application Platform, or CDAP, is a suite of software that integrates the multi-framework system we deliver. CDAP offers developers the power and simplicity they need to build better applications faster.

While that’s obviously good news for developers, we think it’s better news for end users. If Hadoop is to disappear, we need the tools, the applications and the full-stack business solutions that ordinary people use to work with big data. We’re not going to create them by cloning data scientists — we have to render data science skills into software, and deliver great backend analytics with world-class user experience. Application developers using CDAP on Cloudera’s enterprise data hub can do just that.

We’re announcing, today, a partnership that includes collaboration on product, which we believe will deliver the very best developer experience in the industry for big data applications. Cloudera is taking a seat on the Cask Strategic Advisory Board to make sure we stay tight on strategy and plans. We’re backing up our technical partnership with collaboration in the field, so that we work together to help the ISV and end user communities take advantage of the work we do together.

We’ve been close to the Cask team for years, of course, and those personal relationships matter. They are buttressed, now, by a strategic investment by Cloudera in Cask, which we believe will allow our partner to invest in innovation more aggressively than they’d be able to do otherwise. In a market where speed and commercial capacity separate winners from the pack, money makes a difference. We are fortunate to be able to support our partner in work so strategically important to Cloudera.

I’m excited. We’ve never made an elephant disappear before. We’re very glad to have the help of the magicians at Cask for the trick. I want to thank Boyd Davis and Jonathan Gray especially for the hard work on product, and for their help in forming a relationship that we think will transform the big data market broadly.


6 responses on “Cloudera and Cask

Leave a Reply