Cloudera + MongoDB: Deep Technical, Business Integration to Exploit Big Data Opportunity

Categories: Data Science Enterprise Data Hub Partners

By Matt Asay, VP Community, MongoDB

By a number of metrics, MongoDB and Hadoop are the industry’s two most important Big Data technologies. In fact, of the industry’s hottest job trends, both technologies make the list as two of the fastest growing keywords found in online job postings. Small wonder, then, that when data scientists and other data professionals talk about Big Data technologies, overwhelmingly they’re talking about MongoDB and Cloudera.

What they haven’t been discussing, however, is how best to make the two technologies work effortlessly together. Given the rampant confusion about Big Data in the market, and surging customer demand for answers on how to best use MongoDB and Cloudera together, this is a topic that needs to be addressed, and soon.

We’re On The Road To…Somewhere

After all, despite enterprises crowding into Big Data, there’s still significant confusion as to how to derive value from it.

When Gartner asked enterprises about their top Big Data challenges, “determining how to get value from Big Data” was the top response by a significant margin, with “defining our strategy” as the second-highest concern.mongochart

Organizations haven’t been helped by the open-source community, which has invented a broad array of useful Big Data technologies (good) with somewhat weird names (neutral), promoted by vendors with competing agendas (bad).

For example, when I spoke at Strata about MongoDB and Hadoop as perfect complements, many attendees expressed confusion. “Aren’t the two technologies competitive?” they asked. No, not at all. But as vendors we’ve historically done a poor job of describing how popular Big Data software can and should work together.

MongoDB + Hadoop = Petabytes Of Data, Managed

Despite the confusion for some, many organizations have been using MongoDB and Hadoop productively together for some time. Together we’ve found a number of companies that use both MongoDB and Cloudera, and we’re helping them to use the two together more productively. Here are some companies doing some interesting things with MongoDB and Hadoop:

  • Orbitz – As presented at Strata NYC, Orbitz uses MongoDB and Cloudera together to deliver real-time pricing. MongoDB serves as the data collector while Cloudera’s CDH is used to store and analyze the data. The pair has been “entirely worry free” and helps Orbitz compete for travel shoppers.
  • Foursquare – For years MongoDB has been the “source of truth” for data at Foursquare (e.g., total number of users, check-ins at JFK, etc.). But to keep MongoDB highly responsive to reads and writes at all times, Foursquare uses Hadoop to handle “more expensive”, long-running queries, as the company explains. The two together help Foursquare scale the utility of its service to over 45 million users worldwide logging more than 5 billion check-ins.
  • RangeSpan – Rangespan uses MongoDB to provide analytics for retailers who want to set the optimal price for existing products and identify new products of interest, thereby increasing retailer efficiency for expanding offerings. Rangespan runs 10-15 Hadoop MapReduce jobs daily on MongoDB to analyze catalogue metadata. Rangespan also leverages Hadoop to tease out unstructured data, such as competitive pricing culled from a web spider, or product data scraped from a supplier’s site.
  • City of Chicago – The windy city built a futuristic predictive analytics platform on MongoDB and Hadoopthrough which city officials can get a real-time view into crime, health or other citizen issues – even predicting where a flu outbreak might happen or where the city should staff police officers.

Though there are different ways to use MongoDB and Cloudera together, many organizations use MongoDB to serve applications in real time and Cloudera to allow the organization to aggregate and analyze all of its data assets. Combining the two technologies enables them to serve recommendations, provide operational insights and improve the customer experience by integrating and harnessing data sources that have either never before been available or actionable.

In other words, it’s a big deal.

And yet it’s not enough. Using the two technologies together can and should be easier. You shouldn’t have to be a data scientist to figure them out.

Seamlessly Integrating MongoDB And Cloudera

As MongoDB and Cloudera announced recently, the two companies have joined forces to support another approach for analyzing and accessing Big Data. Working together, MongoDB and Cloudera will help organizations optimize key data sources to deliver a long-term, modern data strategy

While the partnership is not exclusive, and both companies will continue to work with diverse data infrastructure providers, this partnership offers a way to improve integration between two of the industry’s preferred Big Data vendors.

At its heart, the partnership involves significant engineering to improve MongoDB’s existing Hadoop connector. The combination of Cloudera Enterprise and MongoDB will enable customers to easily develop, operate and manage Big Data infrastructure that powers modern applications. Through the integration, live, operational data from MongoDB can be snapshotted into Cloudera’s Hadoop-based enterprise data hub and run in parallel for analysis, including native support for BSON, MongoDB’s document data structure. Through the Spark framework or Impala such data can be quickly analyzed and then be passed back to MongoDB to improve the customer experience, e.g., displaying personalized content or tailored offers.

It’s a symbiotic, powerful way for the two systems to improve each other.

In other words, it will make Big Data manageable and actionable through the entire data lifecycle. Given the complexities of Big Data that Gartner’s survey unearthed, it’s a welcome way to help enterprises use data to differentiate themselves and serve customers.

It’s about time.


Leave a Reply