Apache Hadoop for Data Scientists

Categories: Data Science Product Spark

Data Science at Big Data scale is powerful but challenging to build. We at Cloudera are ever focused on bridging the gap between the tools on Hadoop and the tools on your laptop. Today, we announced a number of new initiatives to better enable users to tap into the power of big data and power the use cases of the future.

Python at Scale with Ibis

One initiative in particular aims at bringing a native Python experience to Hadoop at scale. Python is the de-facto language for modern data engineering and data science due to its power, elegance, and robust libraries and third-party integrations. However, Python development has been confined to local data processing and smaller data sets, requiring data scientists to make many compromises when attempting to work with big data. That’s where Ibis comes in.

Ibis is a new open source data analysis framework, enabling advanced data analysis on a 100% Python stack with full-fidelity data – allowing Python users to finally be able to process data at scale without compromising user experience or performance.

Ibis was co-founded by Cloudera’s Wes McKinney, author of best-selling Python for Data Analysis, and creator of Python pandas, the most popular Python data analysis and data wrangling toolkit; and Cloudera’s Marcel Kornacker, the creator of Impala, the fastest interactive SQL framework for Hadoop.

The initial version of Ibis provides an end-to-end Python experience with comprehensive support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling, and analytics. Upcoming versions will allow users to leverage the full range of Python packages as well as express efficient custom logic using Python.

Ibis is an Apache-licensed project and open to contributions from the open source community. It is also available as a preview in Cloudera Labs, a virtual incubator for new projects that further enrich the Hadoop community and ecosystem.

You can learn more about the technical vision and architecture for Ibis on the Developer Blog.

Spark MLlib and Machine Learning at Scale

In addition to bringing a better Python experience for Hadoop, we are also further expanding the machine learning capabilities available to data scientists and developers. In an upcoming release of Cloudera’s platform later this year, we plan to ship and support Apache Spark MLlib, the popular machine learning library. Data scientists will be able to leverage the speed and ease-of-use that Spark provides while developing against popular, built-in machine learning algorithms, such as classification, regression, and clustering. Additionally, as an integrated part of the Spark ecosystem and Cloudera’s platform, data scientists are able to interactively explore data and quickly build applications that can combine with other processing models, such as Spark Streaming for real-time modeling.

Spark MLlib will be a part of Cloudera’s integrated platform – benefiting from the shared simple administration, compliance-ready security, and comprehensive data management available with Cloudera Enterprise. To learn more about Cloudera and Spark, visit Cloudera.com/spark

As with all forward-looking statements about roadmaps, please keep in mind that timing and plans are always subject to change

Wrangle, the Conference for Data Scientists

Finally, even as we introduce new tools for analytics and machine learning into our platform, we are mindful of the fact that many of the hardest problems in data science cannot be solved by technology alone. From the smallest startups to the largest enterprises, we see companies struggling with how to acquire and manage new data sources, recruit and train the next generation of data scientists, and create a data-driven culture that crosses every level of the organization. Cloudera is pleased to announce the first ever Wrangle Conference, an industry event that brings data science practitioners together to discuss how they approach the most difficult aspects of their work.

Wrangle will feature speakers from Facebook, Salesforce, Uber, and others discussing:

  • The state of the art in data measurement, management, and modeling in the real-world

  • How to recruit and lead data science teams during every phase of a company’s lifecycle

  • Practices that promote a data-driven product strategy and culture.

Wrangle will debut this Fall, on October 22, in San Francisco. Registration for Wrangle is currently open by invitation only, with public access available soon.

facebooktwittergoogle_pluslinkedinmail

Leave a Reply