Optimized Prediction on Enterprise Data Hub to Unlock the Value of Your Big Data

Categories: General Partners

It has become more challenging than ever to stay competitive in an increasingly fast-changing economy. Many organizations are turning to machine-learning (ML) technology to help them predict future events and stay ahead of the competition. But building accurate machine learning models and processing big data is challenging, requiring lots of data and the right advanced algorithms.

Over the years we have learned through our customers some of the common challenges in applying standard machine learning techniques to solve real-world business problems (propensity to purchase, client churn prediction, fraud detection and risk modeling, etc.):

  1. Most machine-learning software does not recognize the varying cost/benefit associated with each business event but assign an equal value to all events. In other words, not all your clients are equal in value, but most machine learning software treat all your clients equally.
  1. Most machine-learning software is tailored towards finding common events, instead of the rare ones which are often the most valuable. For example, businesses often only want to focus its resources on clients that are at the risk of leaving or are most likely to purchase a product, which are often only a small portion of the entire population.
  1. Volume and variety of data are accumulating at an explosive rate, but most machines learning software is challenged when trying to scale to big data.

To address these challenges, Yottamine developed a breakthrough patent-pending optimized prediction algorithm and its software implementation, YottamineOP. Yottamine’s Optimized Prediction algorithm has an innovative learning mechanism for constructing a machine learning model compared with standard ML algorithm. Instead of treating all the data equally and minimizing sum of error square for each individual data point during the training phase, YottamineOP takes the size of the reward and the rareness of the data point into account. The result is a ML model that will provide higher business return as the ML algorithm aligns much more closely with the business’ goal, while making sure that the model does not overfit the data.

Translating these into business benefits, YottamineOP can help business achieve higher return by finding the most rewarding event or reduce losses by finding the costliest event. In solving a real-world business use case, YottamineOP generates 50% more profit than the highly popular Gradient Boosting Machine and Support Vector Machine algorithms when given the same training data as input. In this case, the occurrence of a rewarding business event is only 5% of the time and the most rewarding event is worth 200x higher than the least rewarding event. The study is detailed in our “Higher Business ROI with Optimized Prediction” whitepaper.

To address the last challenge, a machine learning algorithm needs to be able to perform its costliest computation in parallel. This will improve the efficiency and speed of deploying ML technology for business. A test conducted on a training data set with 6.3 billion numerical values between a fully parallel algorithm/software that utilizes a 20 node Hadoop Cluster and a single-threaded algorithm on a single server shows the parallel version to be 2 to 3 orders of magnitude faster. To put this into prospective, ML models can be build in hours rather than waiting for weeks. Furthermore, it can allow data scientist to use more data to develop a more accurate and reliable model for their business application. More detail of the test can be found here.

In many cases, the parallel version of a ML algorithm may look very different from its sequential one. Sometimes, it may not be possible to develop a parallel version. In our case, YottamineOP was designed from the very beginning to run on parallel computers. The first version of YottamineOP was developed using Hadoop and Message Passing Interface (MPI). MPI was used because Hadoop was not suitable for iterative part of the computation, as it does not cache data into memory. While using MPI provides the shortest possible CPU runtime as it is written in native code, data needs to be moved out of Hadoop’s HDFS in order to be accessed by MPI.  Yottamine’s Engineering team later implemented a version that uses Hadoop and Spark, as Spark has the advantage of storing data in memory, and at the same time tightly integrated with Hadoop.

Cloudera Certification

Since its inception, Yottamine recognized that big data and parallel processing go hand in hand, and made it an integral part of our software and algorithm design. However, combining parallel processing power of Hadoop with advanced machine learning was not enough. Businesses also need to provide its data scientists and business analysts with easy access to data, and allow the flexibility of adding new data sources in a single location to adapt to the fast changing environment.

With increasing concerns over cybersecurity and businesses wanting their sensitive data to stay on-premise, Yottamine realized that Cloudera Enterprise, as the only solution available that provides strong security on sensitive data through Sentry (for role-based access control) and Cloudera Navigator (for data encryption), would be the perfect big data platform for YottamineOP. Our joint solution offers unique machine learning capabilities to solve business problems in a way that maximizes the ROI on a highly secured parallel processing data platform with unlimited data storage capability.

The diagram below illustrates how YottamineOP is integrated with Cloudera Enterprise, allowing data scientists to use R or Python to build optimized prediction models with ease through their existing Cloudera cluster. Once data scientists are finished preparing their training data set inside R for model building, YottamineOP will upload the R dataframe from the user’s R session to HDFS via WebHDFS, with a simple method call. YottamineOP data transformation module will then perform pre-processing and transformation on the training data set using MapReduce API. In this step, YottamineOP also automatically estimates the hard-to-tune hyper-parameter of the machine-learning model, thus taking much of the guesswork out for data scientists. After the data transformation phase, YottamineOP’s machine-learning engine will utilize Spark as a YARN application to perform the computing-intensive model-building process on the transformed training data set.

Yottamine diagram

At the core of most machine-learning algorithms is an iterative optimizer (such as gradient descent and Conjugated gradient method) that requires computation on the same set of data repeatedly during each iteration. The benefits of Spark’s in-memory, parallel computing capability really shine here, and allow this part of the computation to be done with much greater speed than using MapReduce. Apart from using R or Python to generate training data, data scientists can also select from a rich set of options (Impala, Pig, and Spark, for example) within a Cloudera Enterprise to prepare the data. This gives data scientists more freedom and capability on how they want to work with their data.

The integration of YottamineOP and Cloudera Enterprise provides the only solution that overcomes the deficiencies of common ML technology which treats all data equally, inability to find rare events and unable to scale on big data. This solution is also a perfect environment for data scientist to be more innovative on solving the business problem at hand using big data and machine learning technology.

The Cloudera certification process was transparent, smooth and easy to follow, and it was a pleasure working with the Cloudera certification team. Yottamine is excited to announce our product certification and we look forward to the future opportunities that this joint partnership brings.


Dr. Te-Ming (David) Huang, Founder of Yottamine Analytics

Dr. Huang is a pioneer in developing machine learning algorithms and software for big data. He has made a number of significant contributions to the science of machine learning theory on big data and authored the monograph, “Kernel Based Algorithm for Mining Huge Data Sets, Supervised, Semi-Supervised and Unsupervised Learning”. He was the winner of the best paper award in the KES 2004 international conference due to his novel contribution on improving the accuracy of graph-based semi-supervised learning algorithm. More recently, one of his earlier algorithms for Support Vector Machine is now incorporated by MathWorks, further validating the quality of his algorithms and contributions in the field of data science.

Prior to starting Yottamine, Dr. Huang worked for a number of corporations in applying large-scale machine learning techniques to key business challenges and operation optimizations, targeting digital marketing, text classification, gene microarray analysis, and traffic prediction. Before Yottamine Analytics, Dr. Huang was a research scientist at Microsoft and the senior scientist at INRIX where he specialized in applying his research to commercial applications, in particular large-scale web classification and real-time traffic prediction. Dr. Huang decided to found Yottamine back in 2009 and have been helping many businesses apply machine learning across various industries ever since.


One response on “Optimized Prediction on Enterprise Data Hub to Unlock the Value of Your Big Data

Leave a Reply