Last month, Thomas Dinsmore penned his take, in his inimitable style, on a recent analyst report surveying data science platforms. He noted that the landscape doesn’t seem to overlap much with technologies commonly found in other surveys of most-used data science tools. To paraphrase: Has anyone yet seen a user of IBM’s platform in the wild? Yet you can’t swing a cat at an MLconf without hitting a (very surprised) R user.
Of course, the analyst report intentionally excluded his list of top “actual” data science tools: R, Python (+ scikit), and Apache Spark. It was examining whole soup-to-nuts vendor platforms that encompass elements of modeling, data storage and access, user environment, and deployment. That is, R isn’t directly competitive with Dataiku, for instance. They complement one another. The 2016 O’Reilly Data Science salary survey indirectly makes this point too, when it lists SQL, R and Python as the most-used languages, but does not for example compare these to, say, Apache Hadoop, which gets its own category.
Still, there are obvious commonalities in data science tooling, across industries. And they aren’t these impressive, monolithic analytics platforms for some reason. They are the core languages and toolkits, and, they are open source. Why has the rise of data science been so synonymous with these?
It’s not necessarily driven by technical superiority or high principles about free software. The dominant lesson seems to be that practitioners want to bring their tools to tasks, rather than bring their tasks to a tool set. Portability has been paramount. It’s a variant on the open-core theme in enterprise software that’s at work in the vendor landscape surrounding Hadoop’s ecosystem. It bears examining where these two open source worlds came from, because it’s the combination of both that is the new face of big, open data science.
Roots in Research and Sharing
As recently as a few years ago, I encountered someone — a venture investor no less — who could not understand why people give away intellectual property as open source software. I couldn’t understand why he couldn’t understand it, given the millions of examples. It’s not even new. Academics have been giving away IP by publishing papers for as long as there have been academics. These giveaways are literally the currency of advancement.
Research in any of the sciences needs some statistics. Research is also published to be consumed (cited), and other researchers look to show the previous work they claim to build upon and supersede. This incentivizes standardization of statistical conventions — including software.
It was researchers at Bell Labs who conceived the S language in 1976. Its aim, “programming with data,” describes much of what was innovative about it. It offered a high-level language for describing data, its analysis and its visualization, together. Further, the language was platform-neutral, and interpreted. The source was the executable.
S was onto something, although it was its open-source “clone,” R, that is more widely known today. It was developed at the University of Auckland, maybe in part to advance the cause of free software, but, more likely to scratch an itch. Being open source meant being free, which has always been attractive to cash-strapped researchers. More importantly, it enabled and encouraged sharing of packages via a centralized repository, CRAN. This set in motion a virtuous circle of adoption of R, package creation, and adoption of those packages. Now, I dare say R is used in spite of the venerable R language, but because of CRAN. Interestingly, the fact that R requires the environment to supply packages enforces more standardization in key packages than, say, JVM languages, where applications bring their own private copy of packages.
It’s over-simplifying a bit, but one could look at the rise of Python for statistical computing through the same lens:
- Open source
- Interpreted, platform-independent
- Rich libraries (scikit and many others)
- Easy package management (pip)
- Visualization tools
New data scientists typically start on one of the two, frequently Python, now, because that’s where the examples they want to learn from are. It’s always there (or just a package install away) in any company or environment, not locked behind an evaluation license or procurement process.
There Are Other Worlds Than These
As titanic as the R and Python ecosystems and user bases are, they’re not the only ones for analytics and data science. Continuum’s call for Open Data Science notes the same drivers above — availability, interoperability — but name-checks Scala and Hadoop as well. (Soon, Julia?) Spark’s MLlib subproject sprang up quite independently of existing statistical languages. While it can’t provide nearly the depth of library functions, not in just a few years, it does offer easy scalability and access to the Hadoop ecosystem, where IT departments keep data and secure it. That kind of scale is what Python and R practitioners need today.
Hadoop, as “open big data,” has some parallels in its premises. It’s so valuable to standardize the core elements of storage, formats, and compute, that it’s in many parties’ interests to make it happen as open source as rapidly as possible. Even as organizations and vendors pick and choose different applications and projects to stack around this core, that core is pleasingly stable, ubiquitous, and essentially identical wherever it turns up. The hardware or cloud it runs on, the interfaces and add-ons around it vary, but elements like HDFS, Apache Kafka and Spark don’t.
Yet Python, R, and Hadoop don’t naturally fit together. They’re spun from different languages and runtimes, from different groups of people for different purposes. The future of open data science, however, seems to call for exploratory analytic strength but also security, production-readiness and scalability. We can choose both.
Cloudera’s Data Science Workbench is another environment that layers onto these open cores. Like notebook tools, it provides an IDE-like environment for work in languages like R. Yet it also manages integrating these languages with Hadoop’s security model and execution engine, Spark. In a way it is more bridge than wrapper. It brings these two open worlds alongside one another. Serious integration of classic data science languages with the de facto big data platform means a new unified open data science platform is rising. Maybe we can get back to doing some sciencing!
Find out more about our Data Science Workbench in an upcoming webinar series.