According to the US treasury – “Money laundering is the process of making illegally-gained proceeds (i.e. “dirty money”) appear legal (i.e. “clean”). Typically, it involves three steps: placement, layering and integration. First, the illegitimate funds are furtively introduced into the legitimate financial system. Then, the money is moved around to create confusion, sometimes by wiring or transferring through numerous accounts. Finally, it is integrated into the financial system through additional transactions until the “dirty money” appears “clean.” (Source – Wikipedia)
Figure 1 – How Money Laundering works
The basic money laundering process has three steps:
- Placement – At the first stage, the launderer (typically a frontman such as a drug trafficker, white collar criminal, corrupt public official, terrorists or con artists) inserts the ill-gotten finances into the cash stream of a legitimate financial institution.This is typically done at a Bank using a bunch of small cash deposits. It can also be done at a Securities Broker via inserting funds into buying securities transactions.
- Layering – Layering is the most complex stage of a ML operation.Here the funds inserted in the first step are converted into a legitimate holding either monetary instruments or a physical holding. E.g Real estate
- Integration – At the final stage, the funds are white washed and brought into the legitimate economy via resale of the assets purchased etc.
As discussed in the first post, now more than ever – an efficient & scalable technology implementation underpins an effective AML compliance program. As data volumes grow due to a high number of customers across multiple channels (ATM Kiosk, Online Banking, Branch Banking & Call Center) as well more complex transactions – more types of data are on-boarded at an increasing Ingress velocity.
Thus, the challenges for IT Organizations when it comes to AML (Anti Money Laundering) are now manifold:
- Ingest a variety of data on the fly – ranging from Core Banking Data to Terrorist watch-lists to Fraudulent Entity information to KYC information etc
- The need to monitor every transaction for Money Laundering (ML) patterns as depicted in Figure 1 – right from customer on-boarding. This sometimes resembles the proverbial needles in a haystack
- The ability to perform entity linked analysis that can help detect relationships across entities that could signify organized money laundering rings
- The need to create aggregate and individual customer personas that adjust dynamically based on business rules
- Integrating with a BPM (Business Process Management) engine so that the correct information can be presented to the right users as part of an overall business workflow
- Integrating with other financial institutions to support complex business operations such as KYCC (Know Your Customer’s Customer)
- Provide a way to create and change Compliance policies and procedures on the fly as business requirements evolve
- Provide an integrated approach to enforce compliance and policy control around business processes and underlying data as regulation gets added/modified with the passage of time
- Need to enable Data Scientists and Statisticians to augment classical value compliance analytics with model building (e.g Fraud Scoring) through knowledge discovery and machine learning techniques. There is a strong need to adopt a mechanism of pro-active alerting using advanced predictive analytic techniques
Existing solutions (whether developed in house or purchased off the shelf) in the AML space clearly fall behind in almost all of the above areas. In the last few years, AML has evolved into a heavily quant based computational domain not too unlike Risk Management.Traditional Compliance approaches based on RDBMS’s cannot scale with this explosion of data as well as handle the heterogeneity inherent in reporting across multiple kinds of compliance – both from a compute and storage perspective.
So what capabilities does Hadoop add to existing RDBMS based technology that did not exist before? The short answer is depicted in the picture below.
Figure 2 – Big Data Reference Architecture for Banking
Banking organizations are beginning to leverage Apache Hadoop to create common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both Internal Managers, Business Analysts, Data Scientists and finally Consumers are able to derive immense value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.
Banks can not only generate insights using a traditional ad-hoc querying model but also build statistical models & leverage Data Mining techniques (like classification, clustering, regression analysis, neural networks etc) to perform highly robust predictive modeling. Such models encompass the Behavioral and Realtime paradigms in addition to the traditional Batch mode – a key requirement in every enterprisewide AML initiative.
Now, from a technology perspective, Hadoop helps the Compliance projects in five major ways –
- enables easy ingestion of raw data from disparate business systems that contain core banking, wealth management, risk data, trade & position data,customer account data, transaction data, wire data, payment data,event data etc.
- enables cost effective long term storage of Compliance data, while allowing for daily incremental updates of data that are the norm in AML projects. Hadoop helps store data for months and years at a much lower cost per TB compared to tape drives and other archival solutions
- Supports multi-tenancy from the ground up, so that different lines of business can all be tenants of the data in the lake, creating their own views of underlying base data while running analytics & reporting on top of those views.
- MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time.
- Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Users have a choice of MapReduce, Spark (via Java,Python,R) etc and SAS to name a few. Compliance model development, testing and deployment on fresh & historical data become very straightforward to do on Hadoop.
Figure 3 – Big Data Reference Architecture for AML Compliance
One of the first tenets of the above architecture is to eliminate silos of information by consolidating all feeds from the above source systems into a massively scalable central repository, known as a data-lake, running on the HDFS (Hadoop Distributed File System). Data Redundancy (which is a huge problem at many institutions) is thus eliminated.
The flow of data is from left to right as depicted above and explained below –
1) Data Ingestion: This encompasses creation of the loaders to take in data from the above source systems. Hadoop provides multiple ways of ingesting this data which makes it an extremely useful solution to have in a business area with heterogenous applications which have their own data transfer paradigms.
The most popular ones include –
- Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as HP Vertica, Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB etc. This list of certified & fully supported connectors grows with every release of Hadoop. Sqoop provides a range of options & flags that describe how data is to be retrieved a relational system, how the data is retrieved, which connector to use, how many map tasks to use, split patterns, and final file formats. However, the one limitation is that Sqoop (which gets translated into MapReduce internally) is essentially a batch process. Let us look into other modes of message delivery which need faster processing than batch i.e streaming mode.
- Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
- Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication. Kafka works in combination with Storm, HBase and Spark for real-time analysis and rendering of streaming data.
Developing the ingest portion will be the first step to realizing the overall AML architecture as timely data ingestion is a large part of the problem at most institutions. Part of this process includes understanding examples of a) data ingestion from the highest priority of systems b) apply the correct governance rules to the data. The goal is to create these loaders for versions of different source systems and to maintain it as part of the platform moving forward. The first step is to understand the range of Book of Record transaction systems (lending, payments and transactions) and the feeds they send out. The goal would be to create the mapping to a release of an enterprise grade Open Source Big Data Platform e.g HDP (Hortonworks Data Platform) to the loaders so these can be maintained as part of the implementation going forward.
Quick note on AML Projects, Hadoop & ETL –
It is important to note that ETL (Extract, Transform and Load) platforms & tools are very heavily used in financial services especially in the AML space. It then becomes very important to clarify that the above platform architecture does not look to replace these tools at the outset. The approach is incremental beginning with integration using certified adapters that are developed & maintained by large vendors. These adapters enable developers to build new Hadoop applications that exchange data with ETL platforms in a bi-directional manner.Examples of these tools include IBM DataStage, Pentaho, Talend, Datameer, Oracle Data Integrator etc.
2) Data Governance & Metadata Management: These are the loaders that apply the rules to the critical fields for AML Compliance. The goal here is to look for gaps in data integrity and any obvious quality problems involving range or table driven data. The purpose is to facilitate data governance reporting as well as to achieve common data definitions across the compliance area. Right from the point that data is ingested into the data-lake, Hadoop maintains a rich set of metadata about where each piece of raw data was ingested from, what transformations were applied to them, what roles of users could operate on the data based on a rich set of ACL (Access Control Lists) of permissions etc. Apache Atlas is a project that helps not just with the above requirements but can also export this metadata to tools like SAS, BI, Enterprise Data Warehouses etc so that these toolsets can leverage this data to create execution plans to best access data in HDFS.
3) Entity Identification: This is the establishment and adoption of a lightweight entity ID service. The service will consist of entity assignment and batch reconciliation. The goal here is to get each target bank to propagate the Entity ID back into their booking and payment systems, then transaction data will flow into the lake with this ID attached providing a way to do Customer 360.
4) Data Cleansing & Transformation: Once the data is ingested into the lake, all transformation and analysis will happen on the HDFS. This will involve defining the transformation rules that are required in the Compliance area to prep,encode and transform the data for their specific processing. Hadoop again provides a multitude of options from a transformation perspective – Apache Spark, MapReduce, Storm, Pig, Hive, Crunch and Cascading etc.
- Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, ML Lib for machine learning, and GraphX for graph processing. Sophisticated analysis (and calcs) can easily be implemented using Spark.
- MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS).
- Apache Hive is the defacto standard for interactive SQL queries over petabytes of data in Hadoop. With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other BI & Analytic technologies using a familiar JDBC interface.
5) Analytic Definition: Defining & executing the analytics that are to be used for each risk and compliance area. These analytics span the gamut from adhoc queries to predictive models (written in SAS/R/Python) to fuzzy logic matching to customer segmentation (Across both macro & micro populations) etc
6) Report Definition: This is the stage where Reporting and Visualization take over. Defining the reports that are to be issued for each risk and compliance area as well as creating interesting & context sensitive visual BI are the key focus here. These could run the gamut from a BI tool to a web portal that gives internal personnel and regulators a quick view of a customer or an entity’s holistic pattern of behavior. The key is to provide views of macro aggregates (i.e normal behavior for a customer persona) as well as triggering transactions for a single entity whether a retail customer or an institution. The wide variety of tools depicted at the top in Figure 2 all integrate well with the Hadoop Data Lake including MicroStrategy, Tableau, QlikView et al.
Thus – If done right, Hadoop can form a strong backbone of an effective AML program.
1 comment
Very very well done blog. Thank you for effort