A Brief (and highly simplified) History of Banking Data –
Financial Services organizations are unique in their possessing the most diverse set of data for any industry vertical. Corporate IT organizations in the financial industry have been tackling data challenges at scale for many years now. Just take Retail Banking as an example. Traditional sources of data in this segment include Customer Account data, Transaction Data, Wire Data, Trade Data, Customer Relationship Management (CRM), General Ledger and other systems supporting core banking functions.
Shortly after these “systems of record” became established, enterprise data warehouse (EDW) based architectures began to proliferate with the intention of mining the trove of real world data that Banks possess. The primary intention being providing conventional Business Intelligence (BI) capabilities across a range of use cases – Risk Reporting, Customer Behavior, Trade Lifecycle, Compliance Reporting etc. These operations were run by either centralized or localized data architecture groups responsible for maintaining a hodgepodge of these systems for business metrics across the above business functions & even support systems based use cases like application & log processing – all of which further adds to the maze of data complexity.
The advent of Social technology has added newer and wider sources of data including Social Networks, IoT data and also the need to collect time series data as well as detailed information from every transaction, purchase and the channel it originated from.
Thus, Bank IT world was a world of silos till Hadoop led disruption happened. And the catalyst for this is Predictive Analytics – which provides both realtime and deeper insight across hundreds of myriad of scenarios –
- Predicting customer behavior in realtime,
- Creating models of customer personas (micro and macro) to track their journey across a Bank’s financial product offerings,
- Defining 360 degree views of a customer so as to market to them as one entity,
- Fraud monitoring & detection
- Risk Data Aggregation (e.g Volcker Rule)
- Compliance etc.
The net result is that Hadoop and Big Data are no longer unknowns in the world of high finance. Banking organizations are beginning to leverage Apache Hadoop to create common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both Internal Managers, Business Analysts, Data Scientists and finally Consumers are able to derive immense value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.From a data processing perspective Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models.
Financial Application development, model creation, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop.
How does Big Data deliver value –
Hadoop (and NoSQL databases) help Bank IT deliver value in five major ways as they –
- Enable more agile business & data development projects
- Enable exploratory data analysis to be performed on full datasets or samples within those datasets
- Reduce time to market for business capabilities
- Help store raw historical data at very short notice at very low cost
- Help store data for months and years at a much lower cost per TB compared to tape drives and other archival solutions
Why does data inflexibility hamper business value creation? The short answer is that ranging from 60% – 80% of the time spend in data projects is spent around ingesting and preparing the data in a format that be consumed to realize insights both analytical and predictive.[1]
While Hadoop has always been touted for it’s ability to process any kind of data (be it streaming or realtime or batch etc), it’s flexibility in helping in a speedier data acquisition lifecycle does not nearly get half as much attention. Banking, multiple systems send data into a Data Lake (or an enterprise wide repository of data). These include systems like Accounting, Trade, Loan, Payment and Wire Transfer data etc.
A constant theme and headache in Banking Data Management – some of the major issues and bottlenecks on an almost daily basis as data is moved from Book of Record Transaction (BORT) Systems to Book Of Record Enterprise (BORES) Systems are –
- Hundreds of point-to point feeds to each enterprise system from each transaction system
- Data being largely independently sourced leads to timing and data lineage issues
- End of Day/Month Close processes are complicated and error prone due to dealing with incomplete and (worse) inaccurate data
- The Reconciliation process then requires a large effort and also has significant data gaps from a granular perspective
Illustration – Data Ingestion and Processing Lifecycle
I posit that there five major streams of work encompass the lifecycle of every large Hadoop project in financial services –
1) Data Ingestion: Ingestion is almost always the first piece of a data lifecycle. Developing this portion will be the first step to realizing a highly agile & business focused architecture. Lack of timely data ingestion frameworks is a large part of the problem at most institutions. As part of this process, data is acquired a) Typically from the highest priority of systems b) Initial Transformation rules are applied to the data.
2) Data Governance: These are the L2 loaders that apply the rules to the critical fields for Risk and Compliance. The goal here is to look for gaps in the data and any obvious quality problems involving range or table driven data. The purpose is to facilitate data governance reporting.
3) Data Enrichment & Transformation: This will involve defining the transformation rules that are required in each marketing, risk, finance and compliance area to prep the data for their specific processing.
4) Analytic Definition: Defining the analytics that are to be used for each Data Science or Business Intelligence Project
5) Report Definition: Defining the reports that are to be issued for each business area.
As can be seen from the above, Data Ingestion is the first and one of the most critical stages of the overall data management lifecycle.
The question is how does Big Data techniques help here?
From a high level, Hadoop techniques can help as they –
- Help centralize data, business and operation functions by populate a data lake with a set of canonical feeds from the transaction systems
- Incentivize technology to shrink not grow by leveraging a commodity x86 based approach
- Create Cloud based linearly scalable platforms to host enterprise applications on top of this data lake, including hot, warm and cold computing zones
Thus, Banks, insurance companies and securities firms that store and process huge amounts of data in Apache Hadoop have better insight into both their risks and opportunities. In the Capital Markets Space and also with Stock Exchanges, deeper data analysis and insight can not only improve operational margins but also protect against one-time events that might cause catastrophic losses. However the story around Big Data adoption in your average financial services enterprise is not all that revolutionary – it typically follows a more evolutionary cycle where a rigorous engineering approach is applied to gain small business wins before scaling up to more transformative projects.
The other key area where Hadoop (and NoSQL databases) enable an immense amount of flexibility is what is commonly known as “Schema On Read“.
Schema On Read (SOR) is a data ingestion technique that enables any kind of raw data to be ingested, at massive scale, without regard to the availability of a target schema/model (or lack of thereof) into the filesystem underpinning Hadoop – HDFS (Hadoop Distributed File System). Once the data is ingested, it is cleansed, transformed, normalized and encoded based on the requirements of the processing application.
Schema On Read is a radical departure from the way classical data modeling is performed. Historical data architectures based on relational databases and warehouses. In these systems, upfront modeling has to be performed to be able to ingest the data in a relational form. Once done, this typically leads to lengthy cycles of transformation, development & testing before end users can access the data. It is a well known fact that 80% of the time spent in data science projects around is around Data Preparation or Data Wrangling. Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.[1]
Illustration – Schema on Read vs Schema On Write
Data Ingestion & Simple Event Processing are massive challenges in financial services as there is clear lack of solutions that can provide strong enterprise-wide dataflow management capabilities.
What are some of the key requirements & desired capabilities such a technology category can provide ?
- A platform that can provide a standard for massive data ingest from all kinds of sources ranging from databases to logfiles to device telemetry to real time messaging
- Provide centralized ingest & dataflow policy management across thousands of applications; How to ingest data from 100’s of application feeds (that support millions of ATMs,hundreds of websites and mobile capabilities)
- Pipe the ingested data over to a Hadoop Datalake for complex processing
- Extensibility with custom processors for application specific processing as data flows into the lake
- Robust Simple Event Processing (Compression, Filtering, Encryption etc) capabilities
- How to help model such for consumption by different kinds of audiences ? Business Analysts, Data Scientists, Domain Experts etc.
- How to apply appropriate governance and control policies on the data ?
The next (and second) post in this series will examine an emerging but potentially pathbreaking 100% Open Source technology (incubated now by Hortonworks) that satisfies all the above requirements and more in the area of scalable data ingestion & processing – Apache NiFi [2].
The final (and third) post will then examine the use of NiFi around Financial Services usecases frequently discussed in this blog.
References –
[1] For Big Data Scientists, Janitor Work Is Key Hurdle To Insights
[2] Apache NiFi
https://nifi.apache.org/