To paraphrase Nate Silver: ‘There is lots of data coming. Who will speak for all this data?’
Every day I read articles about how Big Data is changing everything. Data scientists are unlocking new approaches to medicine and biology that help researchers find the cure for cancer, help banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.
It seems like all I need is an analytics platform like Hadoop and a big pile of data, and actionable insights will just leap out, right? Well… not quite. Hadoop makes the difficult easy and the impossible merely difficult. However, we still have to know what we’re looking for and, once we’ve found it, understand what the results mean. The volume, velocity, and variety of Big Data make it hard to know where to focus and even harder to represent insights in a way that is consumable without sacrificing detail. Finding meaningful patterns and converting them into actionable insights requires plenty of computers, sophisticated software, and experts who can use these tools to coax answers from all our information. This is the realm of data science.
Data Science Defined
Like other scientists, a data scientist produces a hypothesis, runs an experiment, and looks at the results to determine whether the hypothesis holds true. In the Big Data space, though, the underlying processes are not quite so straightforward:
- First, gathering enough perspective on a massive data set to generate a hypothesis can be a significant endeavor on its own.
- Second, data science is most often analytical, not experimental, meaning the data has already been gathered as the very first step. This makes the notion of a controlled experiment impossible. Instead, data scientists have to do a form of experimental reverse engineering through careful modeling.
- Third, the real work only begins after a data scientist has proven a hypothesis and discovered a useful pattern in the data. The true challenge lies in turning that pattern into a data product that can be used to analyze new data or perform ongoing predictive analysis.
To be successful, an aspiring data scientist needs a highly sought but difficult-to-attain combination of skills: statistics, programming, machine learning, and multiple technologies (e.g., Hadoop, R, visualization tools). Moreover, the best data scientists distinguish themselves and create value for their companies by applying softer skills like domain expertise (e.g., life sciences, behavior classification, climate science), storytelling, and personal qualities like curiosity, resourcefulness, persistence, and mental dexterity. It’s a lot to ask for, and that’s why the likes of the McKinsey Global Institute, Harvard Business Review, and the Gartner Group project a shortage in the hundreds of thousands of individuals with data science skills over the next few years.
Signal to Noise and Wheat from Chaff
Further complicating the supply/demand imbalance for data scientists is the absence of data scientist professional accreditations to verify capabilities. A small handful of universities have begun to offer degrees in advanced analytics and data science, but these programs are fledgling and require data professionals to dedicate significant time and resources returning to a fully academic setting that, although thorough, does not necessarily certify the mix of skills and experience required of a working data scientist beyond the classroom. There is no International Board of Data Science or Data Science Institute, and the vast majority of managers responsible for hiring data scientists have no data science experience themselves, so a résumé and interview alone will prove little. This dual problem of talent gap and talent unverifiability will only become more pronounced as smaller businesses begin to accumulate Big Data and seek firepower in building sophisticated tools for it.
One part of the solution is a formalized data science curriculum built by actual data scientists. Cloudera offers an excellent three-day Introduction to Data Science course that teaches the fundamentals and trains participants to build their own recommender systems based on insights from data science stars like Jeff Hammerbacher and Josh Wills. Another part of the solution is public data science competitions, through which individuals build experience and demonstrate their chops in a realistic setting.
A Challenge to Shape the Industry
But how much education and practice is enough when it comes to a job whose starting salary is regularly reported around $300,000 per year? This is where a formal industry certification would be most valuable, giving businesses a known yardstick by which to measure practitioners of the trade. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address this gap. We recently introduced a two-part Cloudera Certified Professional: Data Scientist (CCP:DS) program, consisting of a Data Science Essentials exam and a twice-annual Data Science Challenge that helps candidates validate their abilities and helps employers identify elite, highly skilled, and hard-to-find data scientists. Participants who successfully achieve CCP:DS certification will be verifiably among the world’s most employable (and extremely sexy) data scientists.
In addition to certification, the CCP:DS program includes a 60-question Data Science Essentials Practice Test for candidates to self-assess their exam-readiness and a free Data Science Challenge Solution Kit consisting of a live data set, a step-by-step tutorial, and a detailed explanation of the processes required to arrive at the correct outcomes for real-world data science questions focused on classification, clustering, and collaborative filtering of web analytics.
The current Data Science Challenge begins today and remains open until June 30, 2014. Designed by Cloudera’s Director of Data Science, Sean Owen, the challenge asks aspiring data scientists to detect possible errors and anomalies in Medicare claims using a massive set of anonymized healthcare data. Successful participants will be able to develop a data science model to answer a series of questions, including:
- Which medical procedures have the highest relative variance in cost?
- Which three providers had the highest average amount claimed for the largest number of procedures?
- Based on amount and type of procedures claimed, which three providers and regions are least like the others?
- Identify 10,000 patients that seem most likely to need review for possible errors or anomalies. Describe some common features in these patients.
Join us for a webinar featuring Sean Owen on April 10 to learn more about the current Data Science Challenge, how to prepare, and what types of insights drive business value from advanced big data analytics.