Deep Learning in Cloudera

Categories: Data Science

Deep learning is in the news.

It’s good to see people excited about technology. But deep learning is a tool that enterprises use to solve practical problems. Nothing more, and nothing less.

In this blog, we provide a few examples that show how organizations put deep learning to work. Next, we introduce you to Cloudera’s unified platform for data and machine learning and show you four ways to implement deep learning.

Learn more about how to make deep learning work for your organization. Read Deep Learning: A Guide for Enterprise Architects, available here.

Deep Learning in Action

Deep learning emerged as a useful tool when practitioners used it successfully to win competitions in fields such as document analysis and recognition, traffic sign recognition, medical imaging, and bioinformatics. Today, data scientists use deep learning to a variety of practical problems:

  • PayPal, a leading payment systems provider, uses deep learning to detect and prevent fraud.
  • Deep Instinct, a startup, uses deep learning to protect against cyber security threats.
  • Researchers at Purdue University demonstrate a system that uses deep learning to analyze images and assess disaster damage.
  • Lloyds Banking Group uses deep learning to confirm the identity of consumers who call its call center, reducing fraud and improving operations.
  • Scientists at Penn State University and the École Polytechnique Fédérale de Lausanne use deep learning to develop a smartphone app that can diagnose disease in plants and crops.
  • Zebra Medical Vision, a startup, uses deep learning to diagnose breast cancer.

Deep learning is a proven technique and a key driver for digital transformation. Demand for tools and infrastructure is growing rapidly as executives learn more about successful applications.

Deep Learning in Cloudera

Cloudera is a unified platform for data and machine learning. With Cloudera, you bring deep learning to your data and not the other way around.

For today’s complex technology environments, enterprises need choices and flexibility. Cloudera offers multiple ways to train and deploy deep learning models, without new silos or data movement.

Cloudera Data Science Workbench

Cloudera Data Science Workbench (CDSW), enables fast, easy, and secure self-service data science.  It is secure and compliant by default, with support for full Cloudera authentication, authorization, encryption, and governance.

CDSW provides data scientists with a browser-based development environment for Python, R, and Scala. Users can download and experiment with the latest libraries and frameworks in customizable settings, and easily share projects with peers. The software includes built-in scheduling, monitoring, and email alerting.

Figure 4: Cloudera Data Science Workbench

Figure 4: Cloudera Data Science Workbench

The latest CDSW release includes support for GPU-enabled devices. GPUs are specialized processors that accelerate computationally intensive workloads. GPUs are particularly well suited to the training step for deep learning models. CDSW makes it possible for data scientists to use conventional hardware for tasks like data preparation and discovery, and train a deep learning model on a GPU-accelerated machine.

CDSW users share available GPU resources. Users request a specific number of GPU instances, up to the total number available on a node. CDSW allocates GPUs to the job for the duration of the run. Projects can use isolated versions of libraries, and even different CUDA and cuDNN versions via CDSW’s extensible engine feature.

Data scientists working with CDSW can use any deep learning framework that has a Python, R, or Scala API, including TensorFlow, Keras, Theano, Microsoft Cognitive Toolkit (CNTK), Caffe, PyTorch, DL4J, Apache MXNet, Torch, and BigDL.

Learn more about Cloudera Data Science Workbench here.

Apache Spark in Cloudera

Apache Spark in Cloudera provides an excellent platform for transfer learning and inference in an existing cluster. Four open source packages make this possible: BigDL, DL4J, Spark Packages, and Deep Learning Pipelines.


BigDL is a distributed deep learning library for Apache Spark, developed and distributed by Intel. With Scala and Python APIs, the software provides broad support for deep learning model development and inference. Also, users can load pre trained TensorFlowCaffe or Torch models and use BigDL for inference.

BigDL uses the Intel® Math Kernel Library (Intel® MKL) and multithreaded programming in each Spark task. According to Intel, it is orders of magnitude faster than out-of-the-box Caffe, Torch, or Tensor Flow on a single-node Intel® Xeon® processor.


Deeplearning4j (DL4J) is an open source deep learning framework written in Java. Skymind, a Cloudera partner, leads development for the project and provides commercial support.

DL4J is a distributed and multi-threaded framework; it integrates with Spark and trains models within the cluster.  In a distributed environment, DL4J shards, or splits large datasets and passes the shards to worker nodes for execution.  Each node trains a model on its local data; DL4J then iteratively averages the parameters to produce a single model.

In addition to its Java API, DL4J also supports Scala and Clojure, and a Python interface through Keras.

Spark Packages

Two packages in the Spark Packages library support deep learning:

Yahoo! developed TensorFlowOnSpark and CaffeOnSpark to bring deep learning to Spark clusters. By combining features from the deep learning frameworks and Apache Spark, the packages enable distributed deep learning on clustered servers. They support neural network model training, testing, and feature extraction. Both packages have Python and Scala APIs. Data scientists can combine functions from the deep learning frameworks with other Spark tasks in a single pipeline.

Deep Learning Pipelines

Deep Learning Pipelines is a project of Databricks. It provides high-level APIs for scalable deep learning in Python with Apache Spark. The project, currently in its first release, aims to provide easy-to-use APIs that enable deep learning in very few lines of code. Deep Learning Pipelines supports TensorFlow and TensorFlow-backed Keras workflows. The developers intend to focus on model inference/scoring and transfer learning of image data at scale.

Cloudera for Deep Learning

Move forward with Cloudera, the unified platform for data and machine learning. You can learn more about Cloudera Data Science and Engineering here.


Leave a Reply