Fanning the Flames with Apache Spark: Evolving Big Data Processing

Categories: Enterprise Data Hub General Partners Spark

Poca favilla gran fiamma seconda.
“From a little spark follows a great flame.”
– Dante

At Cask, we are passionate about software development and developer productivity in the service of solving big customer challenges. We share our customers’ passion for becoming insight-driven organizations. Such enterprises are always on the lookout for new ways to leverage their strategic data assets to improve revenue, reduce costs, and optimize customer relationships. The challenges to becoming an insight-driven organization have four key dimensions:

  • Infrastructure: Legacy systems lack the performance, power, scalability, and flexibility to advance insight-driven organizations further along their journey, making advanced analytics and apps very costly to deploy on such infrastructure.
  • Technology: The twin challenges here are data access to proprietary information silos and the diversity of approaches that entangle Big Data projects. Advanced analytics and apps are very hard to deploy against the headwinds of complexity driven by multiple technology platforms.
  • Development: Each of these technologies comes with its own low-level API, and entails different development and deployment patterns that vary not just by use case but also by language, framework, and methodology. Big data projects are very hard to deploy predictably in the face of development approaches that entail such high risk.
  • Operations: Even if organizations are able overcome the above infrastructure costs, technology complexities and development risks—deployment, management and optimization challenges are magnified at the point of operational delivery. Advanced analytics and apps are hard to deliver with legacy operational approaches.

In the transition from legacy infrastructure, technology, development and operations to advanced analytics and apps, Apache Hadoop brought two significant innovations: HDFS (Hadoop Distributed File System) and MapReduce (batch processing at scale). But big data’s most recent sea change is marked by the evolution from MapReduce to Apache Spark for fast stream and memory processing accompanied by a rich suite of open source and third-party machine learning and analytic engines. Our belief is that frameworks like Spark will transform the world of advanced analytics and big data applications, fanning Dante’s little spark into a great flame that illuminates insight-driven organizations.

For us at Cask, Spark is now an integral part of the Cask Data Application Platform (CDAP), helping solve complex use cases such as Machine Learning and Interactive Analysis on complex datasets. CDAP tightly integrates with Cloudera’s One Platform Initiative and offers developers the power and simplicity needed to build smarter applications and advanced analytics faster, leveraging the following Spark capabilities of CDAP:

  • Spark Streaming Integration: CDAP enables application developers to manage multiple event data streams—from log files to sensor data—as a continuous stream. For example, CDAP developers can process streams of data related to financial transactions to identify and refuse fraudulent transactions.
  • DataFrame API integration: CDAP supports the DataFrame framework to interact with a distributed collection of data organized into named columns, conceptually equivalent to a relational database table or a R/Python data frame but with richer optimizations. This API enables developers to leverage multiple data sets (and data set types) for iterative processing for machine learning and interactive analytics.
  • ETL pipeline integration: CDAP supports swapping ETL pipelines to migrate new or existing data sets from MapReduce workflows to Spark processing—transparently. Given Spark’s support for both iterative and interactive programs, CDAP enables access to datasets as both inputs and outputs and greatly simplifies customer migration to the Spark processing framework.

In contrast to legacy technological evolution, the key challenge is no longer just about data access anymore. It’s about building algorithms that put analytics into action—with Spark. It’s about changing data science and driving intelligent applications fueled by data—with Spark. Combining data, abstraction, design and speed Cask is creating a new blueprint of innovation—helping customers to become insight-driven enterprises. And in our support of the Cloudera One Platform Initiative, Cask Data is working to make Spark enterprise-ready.

Together with Cloudera, Cask is embracing and advancing Spark to help insight-driven organizations accelerate their big data journey.


Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology, driving company engineering initiatives and collaboration. 

Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.


One response on “Fanning the Flames with Apache Spark: Evolving Big Data Processing

Leave a Reply