In a recent GigaOM article, I shared insights from my analysis of the NFL’s Play by Play Dataset, which is a great metaphor for how enterprises can use big data to gain valuable insights into their own businesses. In this follow-up post, I will explain the methodology I used and offer advice for how to get started using Hadoop with your own data.
To see how my NFL data analysis was done, you can view and clone all of the source code for this project on my GitHub account. I am using Hadoop and its ecosystem for this processing. All of the data for this project uses the NFL 2002 season to the 4th week of the 2013 season.
Two MapReduce programs do the initial processing. These programs process the Play by Play data and parse out the play description. Each play has unstructured or handwritten data that describes what happened in the play. Using Regular Expressions, I figured out what type of play it was and what happened during the play. Was there a fumble, was it a run or was it a missed field goal? Those scenarios are all accounted for in the MapReduce program.
The original Play by Play dataset had information about plays, such as what happened during a specific play, the yard line, the date and the teams involved. But in a football game, there are other, outside influences that come into play, such as the weather, stadium, and type of turf. Because these factors were not included in the original dataset, I found it necessary to augment the Play by Play dataset. By knowing the date and the physical location where each game was played, I tracked down weather and stadium data for each game. Augmenting the NFL data with these additional datasets enabled me to ask bigger questions about what happens in football games.
The next step is joining all of the various datasets together into one massive dataset. This is done with a combination of Hive and MapReduce jobs. The end result of all of these joins is a row per play that includes 101 different data points.
Once everything is joined together, you can start to interact with the data. You can use MapReduce, Hive, Pig or any other ecosystem project at that point. I used Hive and Impala extensively for all of the analysis. All of the statistics and graphs are from data generated by a series of Hive queries. I like to use Hive for its SQL-like syntax and good extensibility via Python scripts.
Working with data is not easy. I spent the majority of my time debugging and figuring out issues in the data. The data is often internally inconsistent; which is especially true for human-generated data. Even the machine-generated data for the weather dataset did not have all of the data for the period. In your Big Data projects, you will need to spend time dealing with the data itself. Make sure your project timeline accounts for this.
Just how difficult is it to get started with Hadoop? You are going to need good developers, but these developers do not need existing Hadoop experience. Education providers such as Cloudera University consistently train developers, analysts and administrators with no previous knowledge of Hadoop. There are many books on Hadoop and there are online resources, such as my screencast or Cloudera’s Udacity course, which cover MapReduce. For anyone with the desire to learn, Hadoop is an approachable technology.