Building an Open Data Processing Pipeline for IoT

Categories: Enterprise Data Hub IoT / Connected Products

Authors: David Bericat, Global Technical Lead, Internet of Things, Red Hat and Jonathan Cooper-Ellis, Solutions Architect, Cloudera

Last week Cloudera introduced an open end-to-end architecture for IoT and the different components needed to help satisfy today’s enterprise needs regarding operational technology (OT), information technology (IT), data analytics and machine learning (ML), along with modern and traditional application development, deployment, and integration.

Red Hat, Eurotech, and Cloudera are working together to address these areas with an open, flexible, modular and interoperable architecture. A big part of that architecture deals with the flow and management of data, as well as the insights, actions, and decisions that can be created from data to produce better business outcomes.

So today, let’s talk DATA!

The open data processing pipeline

IoT is expected to generate a volume and variety of data greatly exceeding what is being experienced today, requiring modernization of information infrastructure to realize value.

To take advantage of all the disparate types of data, Red Hat, Eurotech and Cloudera have built an open and scalable data processing pipeline as part of an end-to-end IoT architecture. This enables data to be acquired, pre-processed, filtered, aggregated and dynamically routed, with only the meaningful information sent to the centralized hub so it can be stored, analyzed, processed, modeled, acted upon, and shared with different applications and services.

The illustration below highlights the typical IoT data journey, with key components of the IoT data pipeline and their functionalities:

 

What’s happening at the IoT edge?

It is already challenging to effectively ingesting and managing data streams coming in from numerous connected devices and assets, using a multitude of different field protocols and network alternatives. The additional need to break through the different data sets to correlate the telemetry data in real-time adds to the complexity. Having to send all that raw sensor data “as is” is not feasible due to constraints around connectivity and network transmission costs, especially in moving target scenarios where intermittent connectivity over cellular or even GPRS or radio occurs. That is why the first step of the open data processing pipeline begins at the IoT edge, where protocol translation and different processing patterns will transform raw data into meaningful data-sets relevant for the business, which are then appropriately routed to proper downstream entities based on predefined scenarios.

At the edge or at the gateways, a combination of Eurotech Everyware Software Framework (ESF) and Red Hat Fuse and Red Hat AMQ middleware provides the ability to more securely acquire data through different industry protocols such as OPC-UA, ModBus, CANbus, Siemens S7, and Profinet, as well as other custom extensions. Additionally, the Wires functionality in ESF, combined with Apache Camel Routes, enables OT architects and developers to create edge applications and to distribute the meaningful business logic needed to optimize business operations.

The IoT integration hub

Along with enabling some of the key functionalities such as device management, the IoT integration hub also logically centralizes management operations, security and data access, and routes the information to the right channels. By combining Eurotech Everyware Cloud with Red Hat AMQ and Red Hat OpenShift Container Platform, IoT specific services like device management, health monitoring, and command and control can be executed in an open hybrid cloud leveraging both public and private cloud platforms , with workloads and data flowing into the Cloudera Enterprise Data Hub (EDH).

The Enterprise Data Hub

Telemetry data routed to the Cloudera Enterprise Data Hub flows into Apache Kafka. From there, it can be easily consumed by an Apache Spark streaming application for processing, enrichment, and near real-time analysis (including machine learning inference), and then persisted in the appropriate storage system.

Cloudera EDH offers several storage options which users may choose from – including HDFS, HBase, Kudu, or any of the leading public cloud object storage objects as well, based on their specific use case requirements. For many IoT use cases, Apache Kudu is likely an ideal choice, as it provides an optimal combination of features for telemetry data, including fast ingestion, updates, and high performance for random reads as well as analytical scans.

In addition to being highly performant, Kudu is also highly scalable, which is important for IoT use cases where the final goal should be matching real-time data coming from the field with long-term historical data to offer a complete historical view to consumers. This is especially important for data scientists and machine learning applications, where subtle trends may only emerge over long periods of time.

For those data scientists, Cloudera Data Science Workbench enables consumption of data from Kudu and other sources, including Cloudera’s other storage systems and contextual data residing in external systems, to analyze and develop machine learning models based on all of the information available to them. Cloudera Data Science Workbench let data scientists use R, Python, or Scala with on-demand compute and secure access to Apache Spark™ and Apache Impala™ to quickly develop or prototype new machine learning models and easily deploy them into production.

Pushing actionable predictions to the edge

Data science is generally considered to be an offline practice, but often IoT solutions demand insights and predictions in real-time. In many cases, even subsecond latencies achievable by Spark Streaming are too slow, or the network is too expensive or unreliable, so machine learning and analytics must be pushed to the edge where true real-time is possible. What exactly it means to “push a machine learning model to the edge” is a mysterious concept to most, and properly explaining that is outside the scope of this article, but suffice it to say that machine learning models are serializable. This means that once trained, they can be saved, loaded, and transmitted over a network. So, in much the same way that an IoT data pipeline enables telemetry to flow from the edge to the data hub, trained machine learning models can flow from the data hub back to the edge.

By implementing edge analytics and machine learning inferencing at the edge, predictions can be made and decisions can be executed in real-time. Cutting edge implementations take advantage of this to optimize operations, productivity, costs and improve safety in the field by reacting faster.

In summary, an open data processing pipeline covers the data lifecycle of your field telemetry data; moving data from the data producer to the right location, at the right time and only the right amount, so it can be processed, stored, analyzed and acted upon across the different tiers of the open IoT end-to-end architecture.

To Learn More:

 

facebooktwittergoogle_pluslinkedinmail

Leave a Reply