Over the last five years, the rapid growth of Python’s open source data tools have made it a tool of choice for a wide variety of data engineering and data science needs. Hugely successful projects that we now take for granted, such as Jupyter, Pandas, and scikit-learn, were comparatively nascent efforts only a few years ago. Today, data teams worldwide love Python for its accessibility, developer productivity, robust community, and “batteries-included” open source libraries.
During the same time period, the Apache Hadoop ecosystem has risen to the challenge of collecting, storing, and analyzing accelerating volumes of data with the robustness, security, and scalable performance demanded by the world’s largest enterprises.
While the Python and Hadoop ecosystems have each flourished in their own right, they have struggled to work well together due to both high-level architectural and low-level technological challenges. Python’s data tools grew out of the scientific computing community and are most popular for problems fitting the single-machine model of computation. Hadoop, by contrast, was developed to address the web-scale distributed systems problems of companies like Google, Yahoo!, and Facebook. Additionally, Python’s core libraries are mostly written in C, C++, and Fortran, while the Hadoop ecosystem is mostly written in Java.
At Cloudera, our goal is to enable Python programmers to leverage their existing tools and skills, which today may focus on single machines, across Hadoop clusters. While there have recently been several efforts to align the Python ecosystem to the challenges of big data, in particular through the Apache Spark project, several hard problems remain unsolved.
It’s been awhile since we last wrote on this topic, so we wanted to provide an update. In this post, we’ll discuss the options currently available to Python users and concrete steps we are taking to better enable a first class Python-on-Hadoop experience.
Current Python Data Landscape
Python has been most successful for data wrangling, analysis, visualization, and application development at single-machine scale. Expanding beyond a single machine is more challenging. For big data problems, a common practice is to sample or reduce a dataset to a sufficiently small size, then analyze the result with local tools like Pandas and scikit-learn.
There are several ways to scale Python across a cluster today, including:
- Executing embarrassingly parallel single-machine Python tasks on each node, perhaps configured through a task management system such as Celery.
- Using Python within a partition-parallel distributed system like MapReduce (e.g. Hadoop Streaming) or Spark (i.e. PySpark).
- Creating Python libraries with domain-specific languages that automatically “compile” to other distributed compute representations, such as SQL or Spark jobs.
The original example of Python-on-Hadoop is through Hadoop Streaming, a flexible interface for writing MapReduce jobs in any language capable of sending and receiving data through UNIX pipes, one line at a time. While a good fit for some ETL and unstructured data problems, in general Hadoop Streaming’s UNIX-pipe model is a poor match for Python’s fast array- and table-oriented computational tools. In addition, developers must pay substantial data serialization costs that undermine the performance benefits of compiled extensions in C, C++, and FORTRAN found in libraries like NumPy, SciPy, Pandas, and scikit-learn.
Spark and PySpark
Today, Apache Spark delivers the most accessible and complete Python interface to data stored in a Hadoop cluster. PySpark provides a Python API for Spark’s core computation primitives and a large portion of its Scala-based analytics modules. The Spark DataFrame API was explicitly modeled after the data frame APIs found in Python’s Pandas library and the R language. This makes it possible to do data engineering in PySpark without the Python interpreter itself becoming involved until the job returns its final results.
However, one of PySpark’s weaknesses is its interface with custom Python code for data transformations or analytics in a larger Spark computation graph. This could include PySpark lambda functions, or SparkSQL user defined functions (UDFs); both are executed similarly. The current implementation suffers from several challenges, including:
- Performance and memory use problems due to low-level data serialization
- Python interoperability issues with the Java Virtual Machine (JVM)
- A UDF interface that can seem awkward to average Pandas users
Additionally, in order to ensure that Python code developed on the desktop functions also works within the cluster, system administrators have struggled to make popular Python libraries available across Spark worker nodes.
Higher-level Python Analytics Frameworks
A number of other Python projects, such as Blaze and Ibis, have developed embedded domain-specific languages (DSLs) that resemble Pandas’ well-known API. These DSLs target SQL or some other representation of a data task that can be executed by a scalable compute engine like Apache Impala (incubating) or Spark. Ibis, which began last year within Cloudera Labs, has targeted Impala for SQL to deliver speedy interactive analytics while simplifying data wrangling tasks involving the Hive metastore and storage systems like HDFS or Apache Kudu (incubating), a new distributed storage engine designed for fast analytics on fast-changing data. Ibis’s Python user interface is decoupled from SQL and could also target Spark (via PySpark) or another analytics backend.
While DSLs can be powerful and easy-to-use, they must be supplemented with custom user-defined Python functions and third-party libraries, which presently results in the same interoperability issues already faced by PySpark.
Hadoop Storage Formats and Nested Data in Python
Another pain point for users of conventional Python libraries is weak support for Hadoop binary file formats like Apache Avro and Apache Parquet. This is the result of a few factors:
- Tools like Pandas do not natively support the nested JSON-like data models enabled by Parquet and Avro.
- Hadoop community efforts have generally focused on performance and scalability for MapReduce, Spark, Hive, Pig, and other JVM-based compute components, rather than interoperability with Python libraries.
Of course, Python interfaces are not always an afterthought. Kudu, for example, now includes a first-class Python client, made possible because Kudu includes a native C++ client that was straightforward to integrate with Python through C extensions.
How Cloudera Can Help
Today we’re pleased to announce two initiatives to address several of the challenges we’ve described in this post.
Making Python Easier to Install with Anaconda
First, we are partnering with Continuum Analytics to offer one-click deployment of Anaconda Python environments to CDH clusters via Cloudera Manager. This eases the pain of developers who need the latest and greatest Python libraries to be installed across a cluster, and of the system administrators who must install and manage those libraries.
Enabling Efficient Data Interchange with Apache Arrow
Second, we are collaborating with the broader Apache community on a new project called Apache Arrow. Arrow is a new in-memory columnar data format specification that will enable better interoperability and improved performance of Python in the Hadoop ecosystem, including tools such as Impala and Spark. We expect Arrow will become an important enabler for Python, and other non-JVM data tool ecosystems including R and Julia, at scale. Our initial efforts will focus on:
- High-performance native Python support for binary data formats like Parquet
- Python interfaces to storage engines such as Kudu
We’re excited at the progress of the community in enabling a better experience for Python users on Hadoop. Stay tuned for upcoming blog posts where we’ll explore both efforts in more detail!
Wes McKinney, creator of Pandas and Ibis, and Apache Arrow committer
Matt Brandwein, Director of Product Management, Cloudera