Cloudera Search: When SQL is not Enough

Categories: General

As a product manager, my engineering team turns to me for answers on things ranging from market trends to product direction and prioritization. When there isn’t an obvious answer to their questions, I turn to data for guidance. In this instance, I was on a mission to see how many (and what kind) of users were putting various powerful features of Cloudera Search to use; specifically i searched for the combinations of features used together. For some product managers, this could be a daunting task. However, as many are aware, we deploy our own technology at Cloudera to expedite insight. I knew utilizing these tools would make finding the answer much easier and faster.

Our internal Enterprise Data Hub gathers, among other data, health and monitoring information from our customers’ production clusters, which, in addition to enabling proactive support, helps us gain valuable insights into customer success patterns. Our EDH also serves applications around customer satisfaction, support process optimization, support case audits, and much more. However, this time, I just wanted to browse support ticket data and usage log data to understand feature combination usage – and it just so happened that Cloudera Search showed out to be the best tool to investigate one of my own products, i.e. Cloudera Search!

At Cloudera, we use Cloudera Search (i.e. Solr integrated with CDH) to serve support ticket data indexed together with other relevant data stored in HDFS and HBase. Through a great set of tools built on Cloudera Search and HBase, I am able to use natural language to query multi-type data without having to worry about limitations of SQL: misspellings, exact matching, query debugging, etc. Free-text search allows for rapid result exploration, which is quickly filtered to the relevant information through faceted search. This quick and easy iteration is particularly important for explorative use cases and data discovery where the answer and the question are more ambiguous. These are the “bigger questions” we all want to ask, while usually not being sure how exactly to find it.


Screenshot of the support ticket exploration tool Monocle, an application built on Cloudera Search, a part of Cloudera’s EDH.

To my great satisfaction (and pride, it’s my product!), I quickly found what I was looking for. Additionally, to my surprise, I also found something only a free-text matching engine could provide. Let’s explore what happened:

Initially, I started by searching for terms highly correlated with use of Cloudera Search products, yet not separately categorized at the time of the support ticket creation – the subcomponents of “morphlines” and “SparkIndexer”. While finding matches within support ticket data wouldn’t yield exact measures, the result set would be a great proxy for the percentage and type of users using this set of Cloudera Search features. I added “HBase Indexer”, another Cloudera Search component, to further explore the data and gain usage insights.

The advantage of using search as the query tool over an SQL engine is that I didn’t have to express all possible combinations of different queries to get the desired results. I typed in natural language words (i.e. “hbase” “indexer”) while Solr returned results that included “hbase” “HBase” “Hbase” “Hbaseindexer” “HBaseIndexer” “hbaseindexer” “Hbaseindexers” “HBaseIndexers” “hbaseindexers” “indexer” “indexers” “index” “indexing”, and so on. Even misspelled instances were matched. The results came back sorted by most matching, making it easy to explore the most relevant email threads first. Further, I could easily drill down into date range (e.g. only tickets between March and August) or filter on case status (open, solved, etc), or other facets that had been indexed. This made it quick and easy to find the relevant results that I wanted to manually inspect and understand.

I was able to estimate the number of Cloudera customers using various aspects of Cloudera Search. The second insight I came across was realizing our internal reporting on support tickets had gaps. With so many individual projects coming together to form Cloudera Enterprise, Search-related tickets were being tagged with component names like “Flume”, “HBase”, or others, dependent on which integration feature of Cloudera Search was used at time of ticket creation. Only by doing a natural language query was I able to gain this new insight, and subsequently remedy it. Now my team and I have more complete visibility into all Search-related tickets, no matter how they are originally tagged, as opposed to a well-intentioned subset.

Free text search is truly powerful when it comes to exploring and finding all relevant data or the most relevant data that you need for insights or decisions. Search gives you more flexibility in defining what a match really is – for times when “exact” is not helping! It is the right tool for interactive, iterative, and free-text exploration, and for finding the needle in an unstructured haystack.

To learn more about Search, visit the Cloudera Search page on our website, and get in touch if you have questions!


Leave a Reply