Machine Learning in the Age of Big Data

Categories: Data Science

Humans are creatures of prediction.  It’s evident in our ability to dress ourselves in the morning in anticipation of the weather and occasion.  Predictions have made an impact in how we exist on the planet from the way we treat illness to the way we grow our cities and governments.  At the foundation of good predictions is data; aided by the judgement and complex reasoning skills of practitioners and domain experts.  

Computers were instrumented very early to address collecting and storing data, and only recently has the concept of providing accurate and reliable oversight with software and algorithms become a reality. Hadoop made it possible to store and analyze data at volumes never before imagined and to deploy and manage predictive models at large scale.  As our desire to use predictions to simplify our life, keep us safe, and guide our decision making increases it is only natural to want to leverage the power of computing for tasks that are otherwise infeasible or require quicker action.


The Practice of Machine Learning

The practice of machine learning as a subfield of computer science has been around since the late 1950s but with the caveat that initial data sampling sizes were small.  So the outcome of machine learning could be valuable but not always accurate when put through much rigor.  While early potential was evident, the right recipe for success was not complete.  After the emergence of the web, machine learning use cases arose specifically for organizations that needed predictions to aid website experience, purchase experience, or manage threats.  As more data was generated on the web, the opportunity to include more data in predictions became possible and a fairly cemented practice received a refreshing new focus…. distributed systems.  

Even though we have unlimited capabilities with technology the field of predictions still relies heavily on oversight. There are many aspects of the way the human mind works that can’t be coded.  We as humans can use experience, judgement, and instinct to make predictive outcomes more accurate. When solving these problems with technology we employ common statistical methods including regression, clustering, and classification. These methods can vary in accuracy depending on the complexity of the problem and for many types of analysis the most simple of methods might be the most effective.  


Machine Learning at Cloudera

For users of Cloudera Enterprise, machine learning is an increasingly common practice.  We have dedicated our focus and resources to ensure we are advancing the state of core components to our platform like Apache Spark.  In fact, in a recent adoption survey over 30% of respondents indicated they are leveraging Apache Spark for machine learning.  


Apache Spark has tools for mainstream machine learning libraries (MLlib) and tools for orchestrating machine learning (Spark ML) and is a core part of our platform.  We offer the tools and access to data that modern organizations need to tackle machine learning on large data sets.  As we see this rise in machine learning usage, we aim to deliver a scalable and enterprise ready experience for our users.  Cloudera helps with machine learning in three key areas.  


More data when building models (empirical) and training models

You might ask yourself, how important is having MORE data?  Machine learning algorithms require empirical data as input. Consider this quote from Banko and Brill.  

“It’s not who has the best algorithms who wins, it’s who has the most data.”  – Banko and Brill

What they are stating here is that fitting a machine learning algorithm to a problem can only get you so far.  It is in fact the ability to provide more data that leads machine learning to become more accurate.  Think of it like education, the more you study and learn the better prepared you are for solving complex problems.  Apache Hadoop allows users to store data at massive scale in full atomic format.  This provides more data points for developing initial machine learning hypotheses and for training.

Often when building machine learning models you need to test and train the operation of the model before applying it in production.  A standard rule of thumb for sampled data might be 30% of the data to test a model, and the remaining 70% to train the model.  Having an expanded data set means we can lead with fewer assumptions and expand our training across more data points, further validating the model.  


Better access to compute resources

As you scale your data collection the compute resources needed to deliver on the machine learning operation must scale to meet the demand as well.  Cloudera Enterprise decouples compute from storage allowing you to design the best environment to fit your workload, with the same familiar operation regardless of size and design.  With components like Apache Spark it makes it increasingly easy and familiar for a data scientist to leverage the power of a large compute cluster to train models. Our platform abstracts away the complexities of orchestrating machine learning on a large cluster, while giving you APIs and libraries you are already familiar with.  We partner with the most popular data science development environments to ensure that users can leverage more data with the same tools.


Machine Learning powers the business solutions our users are building

Machine learning is a mechanism to provide popular functionality like recommendation engines, predictive maintenance, and is a cornerstone of future IoT workflows.  Cloudera customers are delivering strategic functionality with machine learning at its core.  As an example, a popular dating site Zoosk uses Cloudera Enterprise for creating successful matches and reduce user churn using big data and machine learning.  MMO gaming company Wargaming uses machine learning on Cloudera to elevate customer experience and engage users in advanced gameplay.  And one of America’s leading financial services companies Transamerica uses Cloudera to test and validate data models at a much faster scale.  Across industries and verticals there has never been more urgency to explore and implement machine learning solutions.  Cloudera internally uses machine learning to quickly analyze the details from customer cases and log data from our Cloudera Manager tool.  It is important to our customers that machine learning is instrumented to process and act on this data as it arrives, so we can deliver proactive support to our users.

Looking into the near future, machine learning is a core foundation to the world of AI.  While existing machine learning practices help aide in capturing experiences and producing actions, AI presents the opportunity to turn experiences into interaction in real-time and the players at the forefront of this technology are the massive data collectors themselves.  The importance of data volume and access to scalable compute can not be understated as we prepare to design the systems of the future.  As AI and machine learning continue to captivate the minds and dictate the strategies of today’s top companies the need for knowledge, best practices, and technology solutions is paramount.  So are you ahead of the trend or simply along for the ride?  Let me pose a challenge to your organization.  If actress Kristen Stewart can implement machine learning and neural networks to to artistically redraw an image in the style of a source style image, then what is your organization waiting for?

If you are interested in learning more about the practice of machine learning on Apache Hadoop you can take our intro to machine learning course online.


Leave a Reply