Search Engines and Big Data: A Perfect Match: Part 2

Categories: Partners

In Part 1 of this series, I discussed three reasons why search and big data work so well together. In this blog, I’ll discuss three more reasons pertaining to why search and big data thrive in the same environment.


Your Data Does Not Fit into Tables

While it is true that search engines handle tables better than relational databases, how do you normalize a video? A contract? The genome?

Take Cloudera CDH, on which we have built many big data applications, for example. Many of our engagements go like this:

Customer: “We’re putting everything into Impala, but we have this weird type of multi-valued data that doesn’t fit well. Can search handle it?”
Us: “Sure.”
Customer: “We also have these five text fields. I guess they would be better in the search engine too, right?”
Us: “Sure.”
Customer: “We discovered that we have to do 99 joins and it’s really slow. Can we do multi-valued fields in a search engine?”
Us: “Sure.”
Customer: “Maybe it just makes sense to put everything into Cloudera Search? We’ll still use Impala for preprocessing.”

And that’s how it usually goes.


You Don’t Know What’s in Your Data

It’s the mantra of the data lake: “Load all your raw data into a data lake and we’ll figure out what to do with it later!”

So what do you do? You spin up a dozen teams, point them at all of your business systems, and say “Go fetch data!” Now, after a year or so, you have a data lake full of data. There are hundreds of millions of files and folders and billions of records spread throughout your cluster from thousands of systems.

So… Now what?

Point a search engine at it. Search can index, pretty much, anything. Of course, a search engine will do a better job if it knows what’s inside, but if it doesn’t? Well, no biggie. A search engine will just index it and you can do any sort of keyword search on it.

The first thing after loading up a data lake with data is to find the files and folders that contain interesting things. Since a search engine does not need to be pre-defined with a schema (it can index any random bag of tokens, unlike a relational database), it can help sift through your billions of files and folders to find those that contain useful data. Then, you can then start processing for real.

So yes, use a search engine just to find useful data in your data lake, especially when you have a massive lake and you don’t know what’s in there.


It’s Easier Than You Think

It sounds hard – sending everything to a search engine – but it’s really not that hard with the technologies we have today. Why? Because search engines today are tightly integrated with big data.

A common example of our projects is Cloudera Search – the search engine running directly on the Cloudera platform. The indexes are stored in HDFS and can be managed through Cloudera Manager. Further, you can use HUE (the Hadoop User Experience) to do searches, dashboards, and cool analytics.

There are also many other tools that help with this process. Search Technologies’ Aspire platform can now index HDFS files and publish them to Cloudera Search or other search engines. And there are many tools and components (including our own Aspire system) for reading through tables and other structured formats (such as JSON Lines formats), parsing them, and indexing them into search engines.


So It’s Obvious Now – If You Have Big Data, You Need Search!

Incorporating search into your big data projects can help solve all sorts of problems. So, consider implementing search early on in your project planning. Check out some examples below to see how search can fit into your big data project.


Leave a Reply