How I Got Into Hadoop

Categories: General

“Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce.”

These were the opening words in my first blog post on MapReduce, from September 2005. At the time I was working at a London startup called Kizoom, building mobile apps for public transport information (using technologies like WAP and the Nokia 7110—this was before the iPhone). At Kizoom, we were heavy users of Lucene, the search toolkit, and I had written articles about it and Nutch, the open-source search engine, where the code that was to become Hadoop first saw the light of day.

I thought MapReduce was cool, but I couldn’t find any uses for it in processing modest-sized public transport datasets. Besides, we didn’t have a cluster. So I couldn’t justify using it at work. I followed the mailing list to keep up with developments, and continued to blog about it (such as when Hadoop became a top-level Apache project), but otherwise I didn’t spend time on Hadoop.

Everything changed in 2006 with the advent of Amazon Web Services. In March of that year, Amazon released S3, its cloud storage service. Then in August they released EC2, a service that allowed anyone with a credit card to rent servers by the hour. It’s hard to overstate how transformative this was for small startups and independent developers, since it was now possible to access serious compute resources without the upfront cost of server hardware. Of course, it was a perfect match for Hadoop, and I threw myself headlong into writing a Hadoop filesystem for accessing S3 (my first contribution to the project), and scripts for running Hadoop on EC2. The potent combination of AWS and Hadoop was demonstrated to spectacular effect by Derek Gottfrid at the New York Times, who ran 100 EC2 instances to convert 11 million articles from their archive to PDFs in under 24 hours.

In February 2007 I was made a Hadoop committer. At the time, Doug was the only Yahoo! employee who could commit changes, since there was some bureaucratic hold up in the Yahoo! legal department that meant that other Yahoo! committers couldn’t commit to the repository. So I spent quite a lot of time in the evenings committing patches from Yahoo! engineers (they were allowed to create patches and review them). I also did a release when Doug was out on paternity leave.

Kizoom still didn’t need Hadoop, but I continued to contribute to the project in my spare time. I was working on interesting problems with people thousands of miles away, who I had never met in person. One day one of my colleagues at work mentioned that her brother had said he was working with me. I didn’t know what she was talking about, until I realized that ‘St.Ack’ on the Hadoop mailing list was in fact Michael Stack (then at the Internet Archive), and in what was a remarkable coincidence, had a sister who was a developer at Kizoom!

By the end of 2007 I had left Kizoom to become an independent Hadoop consultant (probably the world’s first). My first project was with Last.fm, the music recommender site, which had been running Hadoop from one of the earliest releases. Then in early 2008 I visited California, and met Doug and the rest of the Yahoo! Hadoop team for the first time. By the end of 2009 I had written the first edition of Hadoop: The Definitive Guide, and I was living in San Francisco and working at Cloudera.

Hadoop in Wales: my daughters in 2008

Hadoop in Wales: my daughters in 2008

I’m still amazed that all this happened – and it was only possible because of the Internet, and open source software. In the age of GitHub, it’s easy to take open source for granted, but if Hadoop hadn’t been developed in the open from the beginning, it probably wouldn’t have the diverse user and developer base that it enjoys today. I’m just one example of someone who was intrigued by Hadoop and tried it out and could see the potential in it, and could follow on on the mailing lists, sharing experiences, gleaning tips from other users, contributing small fixes when bugs cropped up. Had it been closed source, it may well have been a successful project at the company where its core developers worked, but even if it had been made open source later, it’s unlikely it would have spread across industry to the extent it has.

It seems fitting to close with the concluding line from my first post on MapReduce, which proved prescient:

“Nutch MapReduce may not be finished, but most of the major pieces seem to be in place, so it is only a matter of time before this exciting and powerful tool sees wider adoption.”

Editor’s Note:: See Tom and Doug Cutting discuss ‘The Next 10 Years of Hadoop’ on stage at Strata + Hadoop World London 2016.

facebooktwittergoogle_pluslinkedinmail

Leave a Reply