Rethink Analytics: Insights from a Data Scientist – Part II

Categories: Cloudera University Data Science Enterprise Data Hub

Previously, I talked about the three insights I gained from Josh Wills, Cloudera’s Director of Data Science, in preparation of the Rethink Analytics, with an Enterprise Data Hub webinar. In addition to my personal revelation during the preparation of the webinar, we also received a number of great questions from the webinar audience.

Is an EDH available now?

An EDH, short for enterprise data hub, is one place to store all data, for as long as desired or required, in its original fidelity; integrated with existing infrastructure and tools; with the flexibility to run a variety of enterprise workloads — including batch processing, interactive SQL, enterprise search and advanced analytics — together with the robust security, governance, data protection, and management that enterprises require.

Cloudera’s support for an EDH is through our enterprise-grade product, Cloudera Enterprise, which is currently available for purchase.

What  internal tools are used for analytics in EDH?

Every data scientist and every data science team has their own preferred stack of tools to use for advanced analytics on an EDH. There are many tools, ranging from BI to machine learning algorithms. However, the key is to use the right tool for the job. Josh uses a lot of R, SAS, Python, etc., but he makes sure that he is choosing the tool that fits with customer requirements, and the tool that is appropriate for the data he is working with at the time.

What makes an EDH and Hadoop so exciting is that every single one of these tools, whether it’s an existing machine learning toolkit or one of a number of new entrants, is all re-orienting themselves around Hadoop. They have realized that Hadoop is where a lot of the data is and where more of the data is going in the future. And thanks to the flexibility of an EDH, these tools can easily be designed and integrated into this data platform.

What skillsets are necessary to enable the success of using an EDH for advanced analytics?

Similar to the question about which tool to use when doing advanced analytics, it is not a question of any one skillset, but more about having the ability to think and reason in a more advanced, complex, interesting way to structure data for analytics.

Josh recommends Cloudera’s training class, Introduction to Data Science. This course provides an overview of recommendation systems and tools and covers a lot of machine learning algorithms, but the primary purpose of the class is to introduce students to the way a seasoned data scientist thinks about data modeling in Hadoop. How the data scientist is going to structure the data, in such a way that they can optimize the overall analytical pipeline and analytical workflow. It is really about the ability to learn and model the data for analysis.

How do you manage data quality in an EDH architecture? What are the interactions in the flow that address data quality?

Data quality, from a data scientist’s perspective, is to have the sampled data accurately reflect the real-world data environment that the data scientist is trying to build the model for. The good news in doing advanced analytics in a Hadoop environment is that the files on Hadoop are essentially immutable in nature; that is, these files are append-only, and therefore you cannot open up a file, make modifications, and then save the file, as you would on a laptop. With Cloudera Enterprise, we couple the immutable data with the ability to track data lineage, so when a data quality problem is found, you can go identify the problem and remove it from the analytical pipeline to prevent this data quality issue to propagate even further.

What is Oryx and how does it differ from Apache Mahout?

Oryx started when Cloudera acquired a company called Myrrix, which was founded by Sean Owen (now Director of Data Sciene for Cloudera in EMEA), a long time contributor to the Apache Mahout project. At the time of forming Myrrix, Sean wanted to focus on a small set of very well implemented algorithms designed specifically for recommendation engines. After the acquisition of Myrrix, Cloudera decided to open source the code and formed the Oryx project. As we covered briefly during our Rethink Analytics webinar, there is a significant gap between the building of the models and the deployment of the model. Oryx, differs from Mahout, focuses on not only the building of these models, but also the serving of these models. With Oryx, data scientists can pick up these models and serve them immediately into production.

What impact will Spark have on an EDH?

We are seeing that Spark is becoming the open-source in-memory analytics component in an EDH. Just as MapReduce has become the general-purpose distributed batch-processing model, we will see Spark become the general-purpose distributed in-memory analytics model.

Right now, the Cloudera data science team is working on porting all the backend algorithms in Oryx over to Spark, to use for machine learning applications. Going forward, we foresee that analytical computing that is done in MapReduce right now will shift over to Spark in the next couple of years.

Check out the Rethink Analytics, with an Enterprise Data Hub webinar replay  >>

Check out Cloudera Enterprise and download a trial of the Cloudera Enterprise Data Hub Edition >> 



One response on “Rethink Analytics: Insights from a Data Scientist – Part II

Leave a Reply