Roundup: Cloudera and Data Science

Categories: Data Science Machine Learning

In this post, we present a survey of stories about data science and machine learning posted on the Cloudera VISION and Engineering Blogs in the past year.

The Power of Data Science

Open data science plays a vital role in enterprise transformation. In this post, Sean Owen, Cloudera’s Director of Data Science describes a vision of open data science. Sean explains the power of machine learning with Big Data here.

The insurance industry views data science as a potential game-changer. Satadru Sengupta, an industry expert with Cloudera partner DataRobot, describes three applications with immediate practical impact.

Data Science for Discovery

Answering important questions is a vital task for data scientists. Data scientists respond to questions that ordinary business intelligence cannot answer — because the data sources are new, or the data processing is complicated and ad hoc. For lack of a better term, we call this process “data science for Discovery.”

Telling stories with data is an essential part of the work. Cloudera partner Dataquest’s Vik Paruchuri explains how to tell stories with data. Part Two is here.

Data scientists address complex questions across many different subject areas. To demonstrate the process, Sean Owen asks: who killed the Somerton Man?

Data Science in the Cloud

Data science needs for computing and storage fluctuate. Most projects last for a limited period — weeks, or months, at most. Workloads are transient and irregular, marked by periodic demands for massive computing power. This pattern of use makes data science an ideal candidate for cloud computing.

However, working in the cloud imposes additional “DevOps” demands on the data science team to configure and maintain the cloud environment. Cloudera Director simplifies data science in the cloud. Jordan Volz explains how to use Cloudera Director to manage a data science environment in AWS.

You can learn more about Cloudera Director here.

Data Engineering in the Cloud

Data science and data engineering work together to drive value. Cloudera recently announced Cloudera Altus, a platform-as-a-service offering for data engineering. Jennifer Wu explains how Altus simplifies big data in the cloud; Philip Langdale offers a deeper dive into data engineering with Altus.

You can try Cloudera Altus here.  

Scoring and Prediction

Predictive modeling is sexy and cool, but prediction drives business value. Analytically sophisticated organizations must deliver model scores on demand, with low latency, to support production applications.

Cloudera architect and technology leader John Lynch explains the benefits of migrating model scoring workloads from a traditional HPC grid to Spark under Cloudera governance.

Apache Spark

Apache Spark plays a vital role in the data science and engineering workflow. Among working data scientists surveyed by O’Reilly Media, Spark is the most widely used big data platform. Cloudera was the first major to embrace Spark; today, Cloudera supports more than 400 customers on Spark, more than any other provider.

Late last year, Cloudera commissioned a survey of the Spark market by the Taneja Group. In a two-part post, Taneja’s Mike Matchett summarizes results. Part one is here; part two is here.

Experience is the best teacher. In this sampling of posts, Clouderans offer tips and explainers for data science with Spark.

  • Anand Iyer explains how to accelerate Spark ML with Intel’s MKL.
  • Mladen Kovacevic demonstrates how to use Apache Spark with Apache Kudu.
  • Juliet Hougland and Sandy Ryza show how to predict customer churn with the Spark machine learning library.
  • Mirko Kämpf explains how to do scalable Graph Analytics with GraphFrames and Apache Spark.

Genetic datasets are truly massive — a single human genome has almost three billion base pairs — so Genomics is a perfect application for Spark. Clouderans Tom White and Jonathan Keebler describe Hail, an open source framework for genomics built on Apache Spark. In a separate article, Tom White explains how the Genome Analysis Toolkit uses Apache Spark for data processing.

SparkR, the “native” R interface to Spark, isn’t widely used. Last year, the team at RStudio, a Cloudera partner, announced sparklyr, a native dplyr interface to Spark. Cloudera engineer Aki Ariga demonstrates flight data analysis with Apache Spark and sparklyr.

You can learn more about Cloudera’s support for Apache Spark here.

Cloudera Data Science Workbench

On May 1, Cloudera announced General Availability of the Cloudera Data Science Workbench (CDSW), a self-service tool for data scientists. CDSW delivers the power of Python, R, and Scala to data science teams working at scale in a secure and managed environment.

In this post, Matt Brandwein and Tristan Zajonc introduce you to CDSW and explain what it does. For a deeper dive, Tristan Zajonc explains how to get started.

Data scientists need to work with specific open source packages; this can create a support nightmare for IT. CDSW makes it possible for organizations to provide the flexibility data scientists need and the secure governance IT demands:

  • Aki Ariga demonstrates how to distribute your favorite Python library on a PySpark cluster with CDSW.
  • Vartika Singh explains how to implement deep learning frameworks like TensorFlow on CDH with CDSW.

You can learn more about CDSW here.



Leave a Reply