No one has to elaborate on the interest and importance of Data Science, so we won’t go into why you should be looking at frameworks and tools to enable AI/ML and more fun things on your Hadoop infrastructure. One way to do this on Oracle Big Data Appliance is to use Cloudera Data Science Workbench (CDSW). See at the end of this post for some information on CDSW and its benefits.
How does it work?
Assuming you want to go with CDSW for your data science needs, here is what is being enabled with Big Data Appliance and what we did to enable support for CDSW.
CDSW will run on (a set of) edge nodes on the cluster. These nodes must adhere to some specific OS versions, and so we released a new BDA base image for edge nodes that provides Oracle Linux 7.x with UEK 4. CDSW supports Oracle Linux 7 as of CDSW 1.1 (more version information here).
With the OS version squared away, we are set to support CDSW, and on a BDA (schematic shown below) with 8 nodes, you would re-image the two edges to the BDA OL7 base image, configure the network and integrate the nodes as edges into the cluster. After this you apply the CDSW install as documented by Cloudera.
As you can see in the image, the two edge nodes are running OL7, but they form an integral part of the BDA cluster. They are also covered under the embedded Cloudera Enterprise Data Hub license. The remainder of the cluster nodes, as would be done in almost all instances, remains your regular OL6 OS, with the Hadoop stack installed. Cloudera Manager if available for you to administer the cluster (no changes there of course).
And that really is it.
Detailed steps for Oracle customers are tested as well as published via My Oracle Support.
What is Cloudera Data Science Workbench?
[From Cloudera – Neither I nor Oracle take credit for the below]
The Cloudera Data Science Workbench (CDSW) is a self-service environment for data science on Cloudera Enterprise. Based on Cloudera’s acquisition of data science startup Sense.io, CDSW allows data scientists to use their favorite open source languages — including R, Python, and Scala — and libraries on a secure enterprise platform with native Apache Spark and Apache Hadoop integration, to accelerate analytics projects from exploration to production. CDSW delivers the following benefits:
- For data scientists: Use R, Python, or Scala with their favorite libraries and frameworks, directly from a web browser. Directly access data in secure Hadoop clusters with Spark and Impala. Share insights with their entire team for reproducible, collaborative research.
- For IT professionals: Give your data science team the freedom to work how they want, when they want. Stay compliant with out-of-the-box support for full Hadoop security, especially Kerberos. Run on Private Cloud, Cloud at Customer, or Public Cloud.
[End Cloudera bit]
If you are reading this you must be interested in Analytics, AI/ML on Hadoop. This post is very cool and uses the freely downloadable Big Data Lite VM. Check it out…
NOTE: This blog was originally posted on Oracle’s The Data Warehouse Insider blog. You can find the post here.