Using Apache Hadoop in Surveillance and Supervision

Categories: General


There has been a spike in regulatory activities in most sectors since the economic crisis of 2008, where a large number of financial institutions had to be bailed out by government.  In banking, for example, there have been a number of regulations introduced including such broad initiatives as the Dodd-Frank Act and the Comprehensive Capital Analysis and Review CCAR. We have also seen governments levy large fines on banking institutions for failing to install proper oversight over their operations.  One example of that was a fine of $1.9 billion paid by HSBC to US authorities over deficiencies in its anti money-laundering controls.

There are three main challenges that most companies face when trying to conduct proper supervision on their operations.

  1. Traditional systems that are used for supervision are not really that powerful in analytics and are fairly expensive.  Typically they can only allow the use of rule based algorithms, which are fairly simple.  For example rules based on document metadata or simple keyword searches.
  2. Most large corporation have various systems where the information is stored.  Emails and instant messages might be stored in a document database on shared drives.  Trading data might be stored in structured data warehouses, such as Netezza or Oracle.  Other data, such as phone call audio files and video surveillance files, might just be stored on a regular NAS filesystem.  It is not trivial to bring all of this information together and conduct an investigation across all of this variety of data
  3. As companies expand their operations and venture into new businesses, its surveillance personnel must keep up with the growth.  Otherwise, it is very easy to find yourself in a position when there are just not enough people to conduct proper investigations.  Especially with a very large number of false positive alerts that are generated by systems in bullet point # 1.

In this blog, we’ll explore how you can offload the surveillance/analytics functionality to a Cloudera Enterprise Data Hub (EDH) powered by Apache Hadoop, which will give you a fast, scalable and cost effective solution with a long history (if desired), while saving some money in the process. An EDH can be used to aggregate all of the different data sources on HDFS filesystem and expose a consolidated view of the data to the internal surveillance groups.  In addition, new frameworks  like Apache Spark can be used to run Machine Learning algorithms to generate actionable alerts with fewer false positive alerts.

Offline vs Online Detection

It is important to differentiate between two use cases that EDH can help with.  We will refer to them as offline vs online detection.  

We cover offline detection in this post and generally define it as the ability to flag a suspicious activity, as it may warrant further investigations. There are two types of Models used in event detection.  

  • Rule based system: An alert is triggered when a set of activities fire up a certain set of encoded rules in the processing engine.  
  • Machine Learning models: We can use some historical data to train a model which uses the live feed of data to quantity the events as suspicious or not with certain probability. If there is no training data available, we can use some unsupervised models such as clustering (KMeans or LDA) to guide the decision making process probabilistically.

Offline design primarily operates on the principles of archiving the data and batch processing. Analysts can use this data to design new rules for rules based systems and the Machine Learning (ML) algorithms can use this data to generate models and backtest/cross-validate the parameters to get higher accuracy with minimal bias and least overfitting. For business analysts and investigators this data can also be used for exploratory analytics and forensics. It is much easier and cost-effective to work with the datasets that reside in the same physical system, than to have to deal with connectivity to multiple siloed systems.  When it comes to training Machine Learning algorithms, suffice to say the more historical data you have and the more processing power, the more successful your algorithms will be.  With a cluster of 50 EDH nodes and an average server with 256GB of memory, 40 HT cores and 24x1TB drives, combined resources of the cluster would add up to 2000 cores, 12TB of memory, 1.2 PB of storage and 1200 hard disk spindles.  That is a lot of processing power to do number crunching when backtesting the models!

Ingesting The Data

Most of the institutions already have all the information they need to properly design ML algorithms and conduct investigations.  The trickier part is to bring in all the data into the EDH., as the data resides in disparate, siloed systems.  This will be the first logical step in our journey.

Most of the content that follows, outlines methods of how to ingest different types of data into EDH in a typical setup.  There could be some variations of ingestion mechanism in a specific firm, but the methodologies here could be easily adapted to handle all of them. In addition, we will leave out real time ingest techniques out of this blog in the interest of concision.  Suffice to say there are quite a few tools in EDH stack, such as Apache Kafka, Apache Flume and Spark Streaming that allow processing high volumes of data at high speeds that customers are using in production today to  speed up detection processes significantly.

Processing Documents

In a typical corporation, documents reside all over the organization in all different shapes and forms.  Some reside in Microsoft Sharepoint, Exchange, Google mail others, are on NAS drives.  Surprisingly, some companies don’t even analyze all of the documents that they possess!  For example some of the current solutions on the market that analyze emails are only looking at the message metadata and simple keywords to flag suspicious behavior.  After that the messages are archived on tapes never to be seen again.  EDH brings a new set of capabilities to help corporations to inexpensively store and analyze all of their data extending on one of the first use cases for Apache Hadoop many years ago as an active archive.

To help companies ingest various documents, open source projects such as Apache ManifoldCF can be utilized.   Many companies utilize Apache SOLR and  Apache HBase to implement eDiscovery systems in their clusters.  Here are few blogs that exemplify how this could be achieved, “Email Indexing using Cloudera Search and HBase” and “Indexing PDFS Images at Scale”

Once the data is in EDH, it stays there indefinitely.  Companies are free to devise their own retention strategies, but the system scales linearly if required and Cloudera Navigator now includes a unique policy engine that automates critical data stewardship and curation activities, including metadata classification and data archiving and retention.Therefore, expanding the cluster will expand both the processing power and storage space.  But what about the WORM (Write Once Read Many) requirement you may ask?  WORM is often a must have requirement for many financial and government organizations around surveillance.  The feature is coming to HDFS sooner rather than later, but if that is something required today, some 3rd party Hadoop based solutions can on the market can be utilized.  One such product is EMC’s Isilon, and EDH works very well on top of it.  

Processing Voice Calls

Voice calls could be transferred to HDFS in a form of audio files, via standard tools, such as hdfs commands or hdfs gateways.  After the files are in HDFS, they would be transcribed via a Spark application, utilizing one of the open source or proprietary transcription libraries.  One of the options is an open source library called CMU Sphinx.  The library was origionally developed by Carnegie Mellon University and is available under a BSD like license (  )

The beauty about using Spark to transcribe the audio files is that it is a fully distributed system, which means all of the cluster resources would be used utilized for processing.  This means many files could be transcribed in parallel, reducing the duration of this process significantly.

Processing trade data

Trade and Position data typically resides in one of two places.  Either it can be fetched from an enterprise Data Warehouse or from a NAS folder as files.  This data type is very structured, therefore it is easy to ingest and analyze it using regular SQL tools in EDH ecosystem.  

If we are ingesting from an RDBMS, Apache Sqoop can be utilized to do periodic delta pulls.  Apache Sqoop already has many fast connectors built for it by various database vendors.  In some cases where fast connectors are not available, a generic JDBC driver can be used to pull data from most database systems on the market.  Sqoop will open multiple connection to RDBMS and land the data in a folder on HDFS.  After the files are on HDFS, a Hive schema can be applied which would very closely mimic the schema in the source RDBMS

If the Trade data is available as files, than the process would be very similar to above.  They would be imported to HDFS via standard ecosystem tools such as hdfs commands, Apache Flume or HDFS NFS Gateway.  After the data is in HDFS the schema is applied.  Many times messages are provided in a FIX (Financial Industry Exchange) format.  There is an excellent blog on how to handle this type of messages in Hive: How to Read Fix Messages Using Apache Hive and Impala.

Processing Video Files

Video servers produce video files, which than can be imported to HDFS.

A Spark application can then use OpenCV (JavaPresets ) library to decode the videos and generate frame images.  It can also detect and track features in each extracted frame.  These features can be saved in HBase for further analytics.

Object detection in video frames

OpenCV can be used here for computing the features and doing object detection, which subsequently feeds into a classification algorithm.

New objects can be trained in OpenCV to be detected. We describe Cascade Classifiers as an example. One can use either the opencv_haartraining or opencv_traincascade applications for training for detection of new class of objects. The later supports both Haar and LBP features and LBP features, being integral in nature are faster to compute and suitable for detecting objects in a high frame rate scenario.

  • Data Set: We need a set of images corresponding to negative and positive sets.
  • Generating/Gathering the samples for training”
    • Positive samples : correspond to images with detected objects.
      • Depending on image type, a large number of samples may be needed to extract the right features.
      • For example, for a rigid image, such as a logo, couple of positive samples should be enough.
      • For an object whose appearance differs with perspective, we may require thousands of images for a good representation of the object, such as a face, or a telephone. This set can be created by specified amount of range and randomness by opencv_createsamples utility.
    • Negative samples: correspond to set of images with no occurrence of the detected objects.
      • These samples are enumerated in a text file, separated by newline.
  • Training: Either opencv_traincascade or opencv_haartraining may be used to train a cascade classifier.
    • Use the options at command line to generate the trained model.
    • After the classifier is finished the trained cascade will be saved in cascade.xml

A simple example for face detection using OpenCV would go something like this:

  • Convert to Grayscale
  • Equalize the histogram
  • Use a frontal face Haar Cascade feature extractor for face detection. ( Face Recognition, however is out of the scope for this blog )

Subsequent objects can be trained for detection within these images as well.

Classification/Anomaly Detection

The alerts for rule based system are easier and tangible to craft. But, considering the number of ingested data sources, these rules can get tedious and unwieldy.

We can have each data source be operated upon by a suitable classification algorithm and combine the results with the respective weighted probabilities. For example, the documents and transcribed phone calls could be classified using a simple Naive Bayes model, using MLLib. The documents will have to be transformed to the TF-IDF model in order to featurize them before feeding into the Naive Bayes classification model. More details could be found here.

All the data sources can similarly have their own classifiers to generate an output class “anomaly(0)” or “not(1)”. Certain data sources could also have multi class outputs such as with the video data, “hasPhone”, “hasFace” etc. or gestures such as “thumbsUp”.

The output of all such classifiers can hence be used to train a simple logistic regression model or use a simple voting mechanism and subsequently use that as prediction mechanism for generating alerts.

We also have the option of training one single classifier using the features from all the data sources combined. We can use the class of ensemble classifiers, such as Random Forests. Random Forests are ensemble of decision trees, which for this particular case, may be very well suited. They do not require feature scaling and are able to capture non-linearities and feature interactions pretty well.

And as we mentioned earlier in the post, in case of lack of training data available, we can use some unsupervised models such as clustering (KMeans or LDA) to guide the decision making process probabilistically.

Either of these approaches, would result in an automated, scalable and probabilistic generation of results.

We can leverage Spark Streaming here, as it can process many data streams in parallel, especially
if all the data streams are of the same type, all the streams get merged
and processed together through the same operations. This pipeline allows for scalability, as we can ingest images/frames as fast as they are produced by a typical large corporation by adding nodes to the cluster as needed.

The support for MLLib on Spark Streaming gives us the advantage of being able to do run clustering or dimensionality reduction algorithms on the image data, if needed, prior to being processed by OpenCV for compression of faster processing.


In this blog we covered the full lifecycle of using an Enterprise Data Hub in Surveillance from ingestion of disparate data sources into an EDH platform to subsequently using that data for generating alerts in an automated fashion using Machine Learning algorithms. We also covered some methodologies to featurize a few data sources using either third party libraries (eg. OpenCV) or MLLib on Spark in order to generate data models for the downstream classification.  We showed how this can help companies conduct surveillance of their businesses, in an automated and scalable fashion.  The overall framework here, however, can be extended or modified to serve many other use cases.  The ability to combine different data sources, featurize the data, run Machine Learning algorithms are all things that pioneer companies like Google, Yahoo, Amazon, Ebay, Facebook have been doing for years.  Now all mainstream companies can do exactly the same things to improve their operations and have structured and demonstrable monitoring built with the gathered data to secure the operations, data, services and hence avoid being levied by severe penalties.  There are numerous cases similar to HSBC case described in paragraph.  Companies can avoid billions of dollars in fines by doing the right thing and investing into a system like EDH and save money by replacing existing systems with a single consolidated hub. Even the regulators themselves are using Hadoop based EDH’s for surveillance to monitor compliance FINRA the Financial Industry Regulatory Body is a great example of a Cloudera customer processing 50 billion events per day. However the most interesting aspect of using an EDH is that you can also then re-purpose the same data for not just surveillance but other useful board level objectives like customer 360, cyber security etc..


One response on “Using Apache Hadoop in Surveillance and Supervision

Leave a Reply