Beyond ETL: Real-time, Streaming Architectures

Categories: Corporate Data Science Enterprise Data Hub

Last year we talked a lot about Cloudera Enterprise as a next generation operational datastore. This is still a focal use case for our technology and we continue to have conversations with users on how they can offload their data processing and capture more incoming data with our platform. This year we wanted to evolve the conversation to what we think is the next set of capabilities that can be enhanced leveraging Cloudera. Increasingly organizations are tasked with bringing in more data, and deliberate companies are scoping and researching the best data streams to include in their analytics efforts. To further enable this, the role of the data engineer becomes increasingly more relevant to the conversation as they are engineering these pipelines of data and working with IT to build the right architectures to support it.  

That is why we launched our Beyond ETL Webinar Series, to look at the landscape and speak with subject matters experts on how they are building, securing, and engineering data pipelines of at impressive scale. We wanted to give you a view from the lens of an architect on how these advancements in the ecosystem affect how you design your environment.  

To start, we wanted to focus on three key themes that come up in our conversations with customers when scoping data pipelines

  • How do I best think about capturing streaming data?
  • How do we effectively secure incoming data?
  • How can I leverage the scale to further scale my data processing capabilities?

We set out to answer these questions over the month of April and we encourage you to dive into the full content here. Below is a quick recap of the topics we discussed.

 

  1. Beyond ETL: End to End Streaming Architectures In this session we talked with professional services subject matter expert on streaming Amandeep Khurana. Amandeep talks with customers about how to best think about workloads like IOT and streaming data. Amandeep is also an author of a popular Hadoop book.  We began with a look at limitations of traditional systems, including poor performance and inability to accurately capture streaming data. This led into to a primer of ingestion and stream processing capabilities inside of Hadoop including Kafka, Flume, and Spark Streaming. Then Amandeep shows a couple examples of typical architectures (bus centric, file system centric, and hybrid). We concluded with a quick note about what these tools and capabilities mean to thing like the internet of things.
  2. Beyond ETL: Data Pipeline Security For this conversation we wanted to take a look at some critical aspects of the pipeline and look at how each Hadoop component addresses security and also how security is layered over the various components. We spoke with senior systems engineer Sean Pabba who helps users understand and architect secure, compliance-ready architectures. In order to address security over your entire environment we encourage people to think about how they address parameter (cluster) security, access (permissions), visibility (reporting), and data security (encryption). Sean takes us through the various hadoop components and explains how each component addresses its role in security and how they integrate with the system wide policy and governance tools. From disk encryption to integration with RecordService and Cloudera Navigator Sean talks about how each of these solutions addresses the 4 pillars of security.
  3. Beyond ETL: Data Processing in the Cloud We ended April with a look at how the cloud is transforming how people consider deploying and scaling data processing efforts. Cloudera customer Sabre is analyzing millions of travel records a day. They needed the opportunity to scale to meet this huge demand of incoming processing jobs and the public cloud gave them the capability to meet that need cost effectively. David Tishgart spoke with Madhuri Kollu from Sabre about why she choose the public cloud for her data processing needs. Madhuri explained that complications with her datacenter caused slow performance and were not meeting the needs of the business and that downtime and patching caused them to constantly troubleshoot. With the new solution Sabre aims to process increasingly more data on a fault-tolerant cloud with low maintenance overhead. In the future Sabre wants to create a unified data environment for all types of data with flexible user-based security.

As the dialog continues, we hope to continue to showcase these examples of how Apache Hadoop is redefining the abilities of modern data processing and ingestion via the voice of our customers, experts, and industry resources.  Next, we will be looking about how contiguous pipelines are enabling the world of IOT with Cloudera partner Streamsets.  Streamsets and Cloudera are a powerful combination when it comes to solving the complex challenges of fielding massive sensor data.  We hope you can join us for the conversation!

facebooktwittergoogle_pluslinkedinmail

Leave a Reply