Agile Machine Learning on Apache Hadoop – Skytree with Apache Spark

Categories: Partners

This blog was penned by Abhimanyu Aditya, Co-Founder and MTS @ Skytree

Data scientists look at some of the toughest problems in the world using machine learning. From analyzing fraud, building recommendation engines to preventing failures before they happen, they are tasked with predicting the future by laboriously walking through the data science lifecycle. This includes the repetitive tasks of gathering, exploring, and preparing data, experimenting, testing and validating hypothesis. The batch oriented nature of processing prevalent today, along with the advent of big data can make this process extremely time consuming and tedious.

For data scientists working with Cloudera’s enterprise data hub, Apache Spark has emerged as a standard tool for Hadoop and a faster alternative for batch processing. The rationale for Spark is iterative in-memory processing and data manipulation, which enables speed-ups for advanced, machine learning-centric data preparation tasks.

Skytree, an inaugural member of Cloudera’s Accelerator Program, has leveraged Spark to build a rich set of data wrangling, preparation and transformation features into Skytree InfinityTM – an interactive enterprise grade software platform designed from the ground up for Enterprise Machine Learning. The integration of Spark with Skytree Infinity enables data scientists to prototype workflows quickly, and iteratively build highly accurate and robust models for processing structured, unstructured text and time series data.



The solution of Skytree with Spark running on Cloudera’s platform simplifies the data science lifecycle in the following ways:

With Skytree Infinity and Spark, data scientists who work on Cloudera’s enterprise data hub can easily access disparate data sources, including RDBMS, HDFS, Hive and flat files. This allows data scientists to work from a single development environment while accessing various platforms, data sources and types of analysis. Data scientists can apply the most advanced machine learning methods to troves of data and obtain accurate insights to help solve the most challenging business problems.

Integration with Spark enables data scientists to execute advanced machine learning data preparation tasks 10 to 100 times faster. Wrangling large quantities of structured, time series, natural language, and semi-structured data can be particularly laborious. With Skytree Infinity’s data preparation for machine learning, which are built on top of Spark and Cloudera, data scientists spend less time preparing data and more time discovering insights. 

In essence, Spark allows data scientists to execute advanced data preparation tasks for machine learning with speed and scalability. Data scientists working on Cloudera’s enterprise data hub along with Skytree’s Infinity platform can now apply the best-in-class machine learning algorithms on troves of data on a single platform, while taking advantage of Skytree’s Spark integration to handle the arduous and iterative machine learning data preparation tasks – gaining faster and better insight from their data.


One response on “Agile Machine Learning on Apache Hadoop – Skytree with Apache Spark

Leave a Reply