The following was originally published on the Sand Hill Group blog.
Although fraud and abuse are often cited as main drivers for the adoption of Hadoop in highly regulated industries, there has been relatively little focus on big data to prevent money laundering within commercial verticals. The former White House Deputy Chief Technology Officer, Daniel Weitzner, recently told The Wall Street Journal, “[Companies have] taken it on themselves to spot fraudulent transactions. [They] have invested billions in incredibly sophisticated Big Data techniques… But the understanding is the government—[and not banks]—will do the analysis to spot money laundering.”
However, a series of high-profile decisions by the U.S. Department of Justice against BNP Paribas, JP Morgan Chase, Barclays, and other large, global banks resulting in multi-billion-dollar fines has brought anti-money-laundering (AML) to the top of the financial services industry’s priority list. While the first wave of investment in big data tools and technology has heretofore been targeted at the identification and prevention of nefarious activities that lead to direct costs for banks, payment processors, and their customers, spending in the near term may likely be related to compliance with three key pieces of AML regulation:
- The Bank Secrecy Act (BSA)
- Know Your Customer (KYC)
- The Foreign Account Tax Compliance Act (FATCA)
Big Data Lightens the Burden of Investigation
Unlike other forms of fraud that are identified with machine learning algorithms that detect anomalies and outliers, money laundering schemes are designed to closely mimic typical banking behaviors and are, therefore, characteristically less anomalous. The thresholds mandated by reporting policies like BSA and utilized by first- and second-generation AML systems are well known, so criminals have little difficulty modeling the source of their above-board trade and transaction behaviors to be largely imperceptible, even to specialized software.
As a result, these systems must be enriched with much larger and more diverse data sets to isolate signals of possible money laundering. When a signal is detected, human judgment must be applied—a case is opened, which kicks off an inquiry to verify the crime and the extent of the damage. Without big data, the AML indicators are often not sufficiently distinct to be caught by computational models and leave most of the work to a time-consuming and expensive investigation. In fact, respondents to KPMG’s 2014 Global Anti-Money Laundering Survey reported they are “increasingly unhappy with their current automated monitoring efforts, [and are] looking for software that can reduce the burden on the compliance department.”
Apache Hadoop is the ideal platform for AML because it augments all of the core functions of a specialized system to better handle big data: data collection, data preparation, automated evaluation, model building, and investigation. The modern AML architecture is fully integrated with an enterprise data hub, with Hadoop initially staging massive complex data for legacy solutions to provide runtimes for the predictive models and perform the actual fraud detection. Beyond the introductory use case of more expansive and affordable storage, Hadoop’s natural fit for backtesting against long-term descriptive data is gaining popularity for more advanced AML workloads, as is the use of other components in the Hadoop stack for exploration, discovery, investigation, and forensics.
Building an AML Solution with an Enterprise Data Hub
Here’s a brief overview of the enterprise data hub value chain for AML:
Data Collection. Bank data tends to be segregated into silos, and modeling is usually limited to a few weeks or months. In contrast, the cost of storing data on Hadoop is typically orders of magnitude lower than every other alternative, meaning data spanning decades can easily and affordably be retained and queried in one place.
Data Preparation. Hadoop excels at data enrichment, transformation, and vectorization prior to being scored for fraud. It enables heuristic matching algorithm required to prepare certain types of data and integrates with familiar ETL tools while Hadoop handles the heavy data collection, transformation, and preparation.
Fraud Scoring. Access to a variety of predictive models improves the accuracy of fraud models. Hadoop’s support for multiple frameworks can bring multiple computational techniques to bear on the AML problem, including static rules engines, state machines, graph algorithms, natural language processing, and machine learning.
Model Development. Criminal methods evolve to evade detection, requiring predictive models to be improved over time. While some models are relatively static, others use techniques like linear regression and clustering, which require training from a historical data set. Interactive query tools like Cloudera Search and Impala facilitate the discovery of new patterns and associations while the availability of more data and processing power in Hadoop allow models to incorporate more parameters, train on longer historical perspective, and iterate more rapidly when backtesting new variations.
Investigation. Improving model accuracy to eliminate false positives, thereby reducing the time- and resource-intensive caseload for the human element of investigation, is a major way Hadoop decreases the cost of AML. As part of an enterprise data hub, ad hoc interactive query reduces the burden of investigation by providing fast answers to arbitrary questions over large data sets.
As part of an enterprise data hub, Hadoop’s flexibility, scalability, and affordability are extending existing investments in dedicated fraud-detection solutions by increasing the volume, age, and variety of data that can be examined while speeding up data transformation for faster time to insight. Once such massive data is consolidated, Hadoop can increasingly take on more advanced AML workloads such as entity matching while Cloudera Search and Impala remove the complexity of model development, process automation, and case investigation.