This blog was penned by the following Clouderans: Alex Gutow, Justin Kestelyn and Eva Andreasson.
Building an open and integrated enterprise data hub goes beyond just utilizing arbitrary open source components. As described in, “Compatibility and Innovation: Where One Ends, the Other Begins,” there needs to be a balance of stability and innovation, and building a platform entirely of open standards ensures that critical balance. Cloudera’s core platform is built on open standards, ensuring sustainable product quality and maturity, ecosystem compatibility, and zero lock-in. Throughout this series, we will look at some of the most popular tools in the platform and see why we and our partners believe that these have become standards.
For flexible, full-text search within Cloudera’s enterprise data hub, Cloudera Search is built on the open standard, Apache Solr. As the only Hadoop platform with native search, users can leverage all the robust functionality of Apache Solr, while simultaneously benefiting from the integrated resource management, administration, and security and governance available with Cloudera’s platform.
Many of our customers are using Cloudera Search (Solr) in production, including the Omneo solution for Camstar, now a Siemens business, for their supply chain cloud solution. Cloudera Search is one of the many analytic frameworks they leverage to rapidly index all of its raw data in a way that makes sense for customers. Kathleen deValk, senior architect at Omneo, stated, “One of our customers has about 1.5 billion documents in the search engine, and we can search all that in seconds.” CounterTack also built their Big Data Endpoint Detection & Response platform, Sentinel, on top of Cloudera’s enterprise data hub and leverages Cloudera Search for real-time endpoint threat analysis – allowing operators to make better security decisions with instant search results in enterprise endpoint environments.
A core criteria for an open source project emerging as an open standard is compatibility with third-parties. Solr certainly fits that bill and many of our partners also recognize the importance of this standard technology. Below is a look at why the Hadoop ecosystem believes Solr is the standard and what we can expect in the future of search.
In your opinion, why has Apache Solr become a standard for big data search?
Guided by a long-term architectural approach, Apache Solr’s mature codebase and innovative, rich, and balanced community assures the best solutions and capabilities are continuously implemented, while also properly addressing the diverse industry use cases and needs from their broad range of developers. It is, therefore, no surprise that Solr and its community have emerged as the open standard for search.
SolrCloud, through its integration with a very mature distributed process manager in the Apache Hadoop ecosystem: Apache ZooKeeper, addressed the industry need for a more scalable, yet still reliable search. But perhaps more importantly, this step into the Hadoop ecosystem at the same time opened the door for the Hadoop community to step in and help drive a vision of multi-workload and analytical search. Two open standard communities marrying through the arrival of SolrCloud and the contributions of integration between Solr and many of the Hadoop ecosystem components, through Cloudera Search, makes it possible for enterprises to build competitive, new applications, combining search and analytics, on open standards — at the same time realize a fully integrated and long-term scalable enterprise data hub for growing data volumes and processing needs.
Solr remains the de facto standard for the world’s largest organizations when it comes to collecting, indexing, and searching their data. Solr’s dominance as the leading open source search platform has cemented its position in the marketplace as the best option for building mission-critical search apps that scale.
We believe Apache Solr has become the standard for search because it addresses the real concerns of enterprises: it’s open, it’s fast, it scales, it’s stable, and it has a rich ecosystem of both commercial and open source software that support it. It’s important that infrastructure software just work under real load, and for real use cases, and Solr does exactly that. The Apache Solr community, including the team at Cloudera, recognize the criticality of large scale search in the enterprise. Integration with data protection and access control systems such as Apache Sentry for role-based collection and document-level access control is a great example of this.
These days, enterprises are far less tolerant of the lock-in that comes from storing critical data in platforms that aren’t open. They need to know that they control their data as well as their costs, and that any investment they choose to make in purpose-built applications on top of this infrastructure is not inextricably tied to a specific vendor. Apache Solr gives customers choice and flexibility, and that’s important.
Apache Solr is the standard for search and analytics because of its ease of use, reliability, scalability, and popularity. As we bring data and technologies together to solve harder business problems, deliver more value, and potentially enrich lives, Solr is a reliable vehicle to help us achieve those goals.
For Zoomdata, big data search is more than just search; it’s exploration of data. With Solr’s simplicity you can get started with a few commands and start exploring. This decreases the time to analytics (and subsequent value) – a primary goal for our customers as they develop their big data strategy. It also lets users gain deeper insights with causality and inference in large datasets. With the powerful facets, multi-field pivoting, and stats supported, Solr doesn’t just return search results, but also provides numerical, spatial, and temporal data alongside textual information. This enables users to ask more interesting questions about the data – focusing on the “why” something happened rather than just “what” happened. Combine this with horizontal scalability, fault tolerance, and a vibrant community that keeps improving the platform and adding ecosystem components, you can achieve a lot with relatively little effort.
Additionally, Solr makes it easy for developers to enhance its functionality to return high quality results. For example, the native integration with Redis provides a cleaner and consistent query expansion technique to provide more contextual information in in the query to achieve high quality results. The recent query filtering/categorizing technique offers another technique to improve confidence and quality of the results as well.
What is next for your organization and Apache Solr?
Yonik Seeley (creator of Solr) recently joined the search team here at Cloudera – a team that includes Mark Miller (co-creator of SolrCloud and Apache Lucene PMC), Wolfgang Hoschek (committer on Apache Lucene and creator of Morphlines), Greg Chanan (committer on Solr and Apache HBase, and Apache Sentry PMC), and Patrick Hunt (PMC for ZooKeeper), and of course Doug Cutting himself (creator of both Lucene and Hadoop). This strong team will help drive Solr (and its integration with Hadoop ecosystem) to the next generation of big data, multi-workload, fully integrated, analytical search – all ready to handle the most critical needs for users and developers, as well as our customers and the industry at large.
We plan to continue to focus on innovation around reliable, scalable search – as that was the aim of our core investment in the community. In addition, we will drive cross-workload and analytical search in conjunction with other open source projects in the Hadoop ecosystem. As the leader in security and governance in Hadoop, we will also work to ensure Cloudera Search and Solr are built with the security, governance, and tooling our customers have come to expect from our platform.
Lucidworks continues to dedicate a large portion of our people, capital, and community in furthering the development of Solr to solve the next generation of data driven challenges at volumes that exceed anything we’ve seen this past decade.
Today, Apache Solr is critical to our ability to provide full fidelity search over hundreds of terabytes of machine-generated event data. Our customers always want to retain more data, for more machines, in order to better understand system behavior, performance, quality of service, compliance, and potential security issues across their data center or cloud infrastructure. For us, that means pushing Solr’s ability to handle more data while retaining the same performance, features, and stability it offers today. We’re very excited to see Solr increasing its support for enterprise features such as advanced access control, encryption by way of integration with HDFS, and its ability to scale to ever larger index sizes. As part of a larger open data management platform, Apache Solr is, and will continue to be, a critical piece of infrastructure that allows us to deliver value to our customers.
Zoomdata allows customers to connect to Solr and start exploring their data right away. We’re working to bring more functionality to that experience with improved faceting and pivoting to allow users to slice and dice their data better. We’re also improving the search query capabilities to allow for search driven analytics with more suggestive searching and type-ahead capabilities. Some specifics include:
- Enabling search everywhere on the page (visualizations, data sources, fields, etc.) with context-relevant results depending on which page the user is searching from.
- Data Visualization Search-Driven Analytics: With the recent release of Solr 5.0, we can provide multi-field pivot with stats (sum, avg, min, max, std) on each pivot point. This will allow “multi-groupby” support in Solr based on textual search filtering and also provide interactions for answering causality-driven questions.
- Leveraging the native integration of Solr with Spark so users can interact with results set locally and quickly.
Solr is a powerful tool both for Cloudera’s platform and the broader Hadoop ecosystem, with active engineering efforts ensuring sustained innovation. For a look at some of the recent innovations with Apache Solr and the Hadoop ecosystem, check out the following:
- How to do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and HUE
- Improved Cloudera Search App in HUE
- Document-Level Security for Cloudera Search
- Call Me Maybe: SolrCloud, Jepsen, and Flaky Networks
- Solr 2014: A Year in Review
- Solr 5 Preview
For more details on Cloudera’s view of open source and open standards, check out “Cloudera’s Commitment to Open Source and Open Standards,” and be on the look out for the next blog examining Apache HBase as an open standard.