Over just a few short years, Apache Hadoop has evolved from a batch processing framework, built on low cost storage and compute hardware, to an enterprise-ready data management platform able to handle real-time storage, processing and analytics applications.
It’s still true, though, that most companies start out using Hadoop as a “data lake” — a scalable data repository built on the cheap-and-deep HDFS (Hadoop Distributed File System) storage economics — to capture data from anywhere, and in any format, for future analysis. That’s no surprise. Consolidating data silos built for special-purpose applications makes that data easier to find, use and combine in new ways.
The data lake use case is a great way for companies to get to know Hadoop. It’s simple and easy to implement, and it delivers easily-measured return on the initial investment. From that starting point, building new analytic and processing applications using Apache HBase, Apache Hive, Apache Pig, Impala, Presto, Apache Spark and other ecosystem components can squeeze new value out of the data. Integrating it with the rest of the tools and platforms that the business depends on turns the data lake into a genuine enterprise data hub.
But what if an organization already owns a data lake? After all, the storage industry has been offering plenty of lower cost, horizontally-scalable platforms for storing large volumes of data for many years. These NAS and SAN systems break down data silos and consolidate less frequently accessed data, for archival and retention purposes, in one multi-petabyte pool. That’s a data lake. Does it solve the big data problem?
Well, no. As Hadoop deployments shift from proof-of-concept sandbox experiments to enterprise-grade, mission-critical production solutions, they take on new workloads, and those workloads need all the power and all the flexibility of those ecosystem components listed above. Customers with existing investments in non-HDFS data lakes are just as excited about attacking new analytic and processing workloads as everyone else.
Setting up an alternative HDFS-based Hadoop cluster using Direct Attached Storage (DAS) would mean copying data from the existing NAS-based data lake into a separate Hadoop installation. Copying is expensive; copying terabytes or petabytes is prohibitively so. To solve this problem, the leading storage solution providers are increasingly active within the Hadoop community, working to bring the power of Hadoop analytics to their non-HDFS data lake customers.
EMC has been especially forward in this regard. I’m very pleased to announce here Cloudera’s ongoing collaboration EMC, and the integration of EMC storage solutions with Cloudera’s platform. As of the Cloudera Enterprise release shipping in October of 2014, customers using EMC Isilon, EMC’s scale-out NAS solution, will be able to keep their data where it is, and to run any Cloudera Enterprise component on it, in place. That release will be jointly supported and certified.
As you may have guessed from our CEO Tom Reilly’s participation in the EMC Isilon team’s keynote at EMC World, Cloudera and EMC have been working on this for some time. You can read more about our collaboration and the strategy behind it here.
EMC Isilon already supports the Apache HDFS protocol, and EMC Isilon customers today have successfully deployed the 100% open source CDH, version 4. With our joint solution, EMC Isilon customers will get all of the unique, enterprise-grade Hadoop capabilities that are only available with Cloudera Enterprise. Whether motivated by added security capabilities like key management, compliance requirements like robust audit and data lineage, automated backup and disaster recovery, or by the critical need for a true enterprise-grade tool like Cloudera Manager to keep the entire Hadoop environment running optimally, EMC Isilon customers will get the processing and analytic power of Cloudera Enterprise atop their existing data lake.
We’re not the only ones excited about turning EMC Isilon data lakes to Cloudera enterprise data hubs. Our collaboration has included EMC Isilon customers who’ve tested and validated the work we’re doing. Those companies are thrilled to be able to preserve the value of their investments in easily-managed, centralized, petabyte-scale storage systems. They’re excited about bringing the rich processing and analytics capability of Hadoop to their existing data. Early Cloudera-on-Isilon deployments include both operational use cases (offloading ETL/ELT workloads from more expensive processing platforms to Hadoop) and advanced analytics applications (industry-driven solutions like fraud detection for financial institutions).
Cloudera remains committed to HDFS and the direct-attached storage architecture that Google invented and that the community designed into Hadoop. As the market has matured, though, and as our enterprise data hub has penetrated more data centers, we’ve encountered vast existing data lakes built on systems like Isilon. We’ve discovered customers with good reasons to choose Isilon — its manageability, integration with existing applications and more. That data is of more value if we can bring it to life with Hadoop’s processing and analytic capacity.
I am thrilled to be working with EMC and the EMC Isilon team to turn these data lakes into data hubs.