We recently wrapped up the third in our series of webinars around the recent release of Cloudera 5.5. In this webinar, “Hadoop for the Data Scientists: Spark in Cloudera 5.5”, Anand Iyer (Senior Product Manager at Cloudera) and Sandy Ryza (Senior Data Scientist at Cloudera) went through an overview of Apache Spark – the most popular Apache project and the emerging standard for data processing on Apache Hadoop – and discussed the recent of addition of Spark MLlib into Cloudera’s platform.
Spark has always been a popular tool for data science and machine learning, and MLlib provides a library of machine learning algorithms to more easily extend Spark’s flexible development and performance for these use cases. In addition to discussing the details of MLlib, we walked through a real-world machine learning use case for how you can use Spark MLlib, as part of Hadoop, to predict churn at a large telecommunication company. Finally, we looked at what’s next for Spark, especially for the roadmap laid out by the One Platform Initiative.
During this webinar, we had some great questions come in from the audience. While we didn’t have enough time to address all these questions during the live webinar, we did follow up with Anand to answer these.
To hear the answers to questions including:
- How does Spark MLlib compare to tools like Mahout?
- What is the roadmap for deep learning support?
- What is the primary storage choice for Spark?
- What resources are available to learn MLlib and Python?
For any aspiring data scientists who are looking to learn about the skills, tools, and trainings required, check out this presentation from Cloudera’s Director of Data Science, Sean Owen where he walks you through how to get started.