The following was originally published by the Wall Street Technology Association in the most recent issue of the WSTA Ticker e-zine.
Records and reporting requirements have long been a challenge for the financial services industry and are the original definition of the sector’s big data problem. The dual objectives of managing historical data to comply with federal requirements and being able to retrieve and query more data on an ad hoc basis can be both disruptive to the business and prohibitively expensive. The diversity of data makes reporting expensive due to the variety of workloads required—ETL, warehousing, reporting—while the structured query language (SQL), which is primarily used for business intelligence and analysis, is not an adequate tool for order linkage.
Audits to comply with the Order Audit Trail System (OATS) regulation of the Security and Exchange Commission (SEC) are complex and costly because they require data to be found, collected, transformed, stored, and reported on-demand from a variety of sources and data formats with relatively short timelines in order to avoid fines (or worse). Once the data is brought together, it typically sits in storage and is no longer easily available to the business. Soon, the Consolidated Audit Trail (CAT) will obligate finer-grained order, cancelation, modification, and execution details in a system governed by the Financial Industry Regulatory Authority (FINRA).
Expanding reporting requirements—for both industry firms and regulatory agencies—are overwhelming systems that were originally built in silos as data warehouses and duplicated and archived for Write-Once/Read-Many (WORM) requirements on tape or RAID. On the reporting side, the RDBMS was not designed for the increasing volume and variety of data required for OATS (and, eventually, CAT) compliance.
Build a Hadoop Active Archive
As the requirements for compliance with an increasing variety of risk, conduct, transparency, and technology standards grow to exabyte scale, financial services firms and regulatory agencies are building enterprise data hubs with Apache Hadoop at the core. With Hadoop, the IT department works across the different business units to build an active archive for multiple users, administrators, and applications to simultaneously access in real time with full fidelity and governance based on role and profile.
By building an active archive with Hadoop, the data required for reporting becomes less disparate and requires less movement to staging and compute. HDFS and MapReduce offer significant cost savings over the vast majority of (perhaps all) other online WORM-compliant storage technologies and are far more format-tolerant and business-amenable than tape storage. The industry-standard servers on which Hadoop clusters are built also provide the benefit of latent compute alongside storage, which can easily be applied to ETL jobs to speed transformation and cut reporting timelines. Natural-language-based query tools built on Cloudera Search provide a full-text, interactive search capability on the data in Hadoop and the scalable, flexible indexing component of an enterprise data hub. Impala provides in-cluster reporting and investigation capabilities to keep the data required for auditing accessible in its original format and fidelity for business intelligence and other workloads, while Apache Spark provides significantly faster and more robust order linkage.
Extend Value with a Data Hub
When used in conjunction with traditional storage and data warehousing, an enterprise data hub is a solution for both the banks building reports and the agencies, such as FINRA (a Cloudera Enterprise customer), that receive, store, and scrutinize them due to Hadoop’s relatively low cost, scalability, and ease of integration. In fact, Cloudera users in the broker-dealer and retail banking industries—including many of the biggest names on Wall Street—have reported completing natural-language-processing jobs that are required for SEC record-keeping in only two hours, compared to at least two weeks to run the same jobs on specialized systems with much larger hardware footprints.