“In God we trust. All others must bring data.” – Dr. Edwards Deming, statistician, professor, author, lecturer, and consultant.
The first post in this three part series described key ways in which innovative applications of data science are slowly changing a somewhat insular banking & financial services industry . The second post then delineated key business use cases enabled by a data driven or data native approach. The final post will examine foundational Data Science tasks & techniques that are commonly employed to get value from data with financial industry examples. We will round off the discussion with recommendations for industry CXOs.
The Need for Data Science –
It is no surprise that Big Data approaches were first invented & then refined in web scale businesses at Google, Yahoo, eBay, Facebook and Amazon etc. These web properties offer highly user friendly, contextual & mobile native application platforms which produce a large amount of complex and multi-varied data from consumers,sensors and other telemetry devices. All this data that is constantly analyzed to drive higher rates of application adoption thus driving a virtuous cycle. We have discussed the (now) augmented capability of financial organizations to acquire, store and process large volumes of data by leveraging the HDFS (Hadoop Distributed Filesystem) running on commodity (x86) hardware.
One of the chief reasons that these webscale shops adopted Big Data is the ability to store the entire data set in Hadoop to build more accurate predictive models. The ability store thousands of attributes at a much finer grain over a historical amount of time instead of just depending on a statistically significant sample is a significant gain over legacy data technology.
Every year Moore’s Law keeps driving the costs of raw data storage down. At the same time, compute technologies such as MapReduce, Tez, Storm and Spark have enabled the organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights need to then be operationalized@scale to provide business value as the use cases in the last post highlighted @ http://www.vamsitalkstech.com/?p=1582
The differences between Descriptive & Predictive Analytics –
Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.
BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).
On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –
- being able to probabilistically predict different business scenarios across thousands of variables
- suggesting specific business actions based on the above outcomes
Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.
Data Science –
So, what exactly is Data Science ?
Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. Data Science is the key ingredient in enabling a predictive approach to the business.
Some of the key aspects that follow are –
- Data Science is not just about applying analytics to massive volumes of data. It is also about exploring the patterns,associations & interrelationships of thousands of variables within the data. It does so by adopting an algorithmic approach to gleaning the business insights that are embedded in the data.
- Data Science is a standalone discipline that has spawned its own set of platforms, tools and processes across it’s lifecycle.
- Data science also aids in the construction of software applications & platforms to utilize such insights in a business context. This involves the art of discovering data insights combined with the science of operationalizing them at scale. The word ‘scale’ is key. Any algorithm, model or deployment paradigm should support an expanding number of users without the need for unreasonable manual intervention as time goes on.
- A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.
- The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. In supervised learning, the constructed model defines the effect one set the inputs on the outputs through the causal chain.In unsupervised learning, the ouputs are affected by so called latent variables. It is also possible to have a hybrid approach to certain types of mining tasks.
- Strategic business projects typically begin leveraging a Data Science based approach to derive business value. This approach then becomes integral and eventually core to the design and architecture of such a business system.
- Contrary to what some of the above may imply, Data Science is a cross-functional discipline and not just the domain of Phd’s. A data scientist is part statistician, part developer and part business strategist.
- Working in small self sufficient teams, the Data Scientist collaborates with extended areas which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOp. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organization.
- It needs to be clarified that not every business project is a fit for a Data science approach. The criteria that must be employed to understand if such an advanced approach is called for include if the business initiative needs to provide knowledge based decisions (beyond the classical rules engine/ expert systems based approaches), deal with volumes of relevant data, a rapidly changing business climate, & finally where scale is required beyond what can be supplied using human analysts.
- Indeed any project where hugely improved access to information & realtime analytics for customers, analysts (and other stakeholders) is a must for the business – is fertile ground for Data Science.
Algorithms & Models –
The word ‘model‘ is highly overloaded and means different things to different IT specialities e.g. RDBMS models imply data schemas, statistical models are built by statisticians etc. However, it can safely be said that models are representations of a business construct or a business situation.
Data mining algorithms are used to create models from data.
To create a data science model, the data mining algorithm looks for key patterns in data provided. The results of this analysis are to define the best parameters to create the model. Once identified, these parameters are applied across the entire data set to extract actionable patterns and detailed statistics.
The model itself can take various forms ranging from a set of customers across clusters, a revenue forecasting model, a set of fraud detection rules for credit cards or a decision tree that predicts outcomes based on specific criteria.
Common Data Mining Tasks –
There are many different kinds of data mining algorithms but all of these address a few fundamental types of tasks. The pouplar ones are listed below along with relevant examples:
- Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.
- Clustering is an unsupervised technique used to find classes or segments of populations within a larger dataset without being driven by any specific purpose. For example – “What are the natural groups our customers fall into?”. The most popular use of clustering techniques is to identify clusters to use in activities like market segmentation.A common algorithm used here is k-means clustering.
- Market basket analysis is commonly used to find out associations between entities based on transactions that involve them. E.g Recommendation engines which use affinity grouping.
- Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
- Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
- Causal Modeling algorithms attempt to find out what business events influence others.
There is no reason that one should be limited to one of the above techniques while forming a solution. An example is to use one algorithm (say clustering) to determine the natural groups in the data, and then to apply regression to predict a specific outcome based on that data. Another example is to use multiple algorithms within a single business project to perform related but separate tasks. Ex – Using regression to create financial reporting forecasts, and then using a neural network algorithm to perform a deep analysis of the factors that influence product adoption.
The Data Science Process –
A general process framework for a typical Data science project is depicted below. The process flow depicted below suggests a sequential waterfall but allows for Agile/DevOps loops in the core analysis & feedback phases. The process is also not a virtual one sided pipeline but also allows for continuous improvements.
Illustration: The Data Science Process
- The central pool of data that hosts all the tiers of data processing in the above illustration is called the Data Lake. The Data Lake enables two key advantages – the ability to collect cross business unit data so that it can be sampled/explored at will & the ability to perform any kind of data access pattern across a shared data infrastructure: batch, interactive, search, in-memory and custom etc.
- The Data science process begins with a clear and contextual understanding of the granular business questions that need to be answered from the real world dataset. The Data scientist needs to be trained in the nuances of the business to achieve the appropriate outcome. E.g. Detecting customer churn, predicting fraudulent credit card transactions in the credit cards space, predicting which customers in the Retail Bank are likely to churn over the next few months based on their usage patterns etc.
- Once this is known, relevant data needs to be collected from the real world. These sources in Banking range from –
- Customer Account data e.g. Names,Demographics, Linked Accounts etc
- Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc),
- Wire & Payment Data,
- Trade & Position Data,
- General Ledger Data and Data from other systems supporting core banking functions.
- Unstructured data. E.g social media feeds, server logs, clickstream data & mobile application data etc.
- Following the planning stage, Data Acquisition follows an iterative process of acquiring data from the actual sources by creating appropriate loaders choosing appropriate technology components. E.g. Apache NiFi, Kafka, Sqoop, Flume, HDFS API, Java etc
- The next step is to perform Data Cleansing. Here the goal is to look for gaps in the data (given the business context), ensuring that the dataset is valid with no missing values, consistent in layout and as fresh as possible from a temporal standpoint. This phase also involves fixing any obvious quality problems involving range or table driven data. The purpose at this stage is also to facilitate & perform appropriate data governance.
- Exploratory Data Analysis (EDA) helps with trial & error analysis of data. This is a phase where plots and graphs are used to systematically go through the data. The importance of this cannot be overstated as it provide the Data scientist and the business with a flavor of the data.
- Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place leveraging models built using the above algorithms.
Within each of the above there exist further iterative steps within the Data Cleansing and Data Analysis stages.
Once the models have been tested and refined to the satisfaction of the business and their performance been put through a rigorous performance test phase, they are deployed into production. Once deployed, these are constantly refined based on end user and system feedback.
The Big Data ecosystem (consisting of tools such as Pig, Scalding, Hive, Spark and MapReduce etc) enable sea changes of improvement across the entire Data science lifecycle from data acquisition to data processing to data analysis. The ability of Big Data/Hadoop to unify all data storage in one place which renders data more accessible for modeling. Hadoop also scales up machine learning analysis due to it’s inbuilt paralleism which adds a tremendous amount of value both in terms of training multiple parallel models to improve their efficacy. The ability to collect a lot of data as opposed to small samples also helps greatly.
Recommendations –
Developing a strategic mindset to Data science and predictive analytics should be a board level concern. This entails
- To begin with – ensuring buy in & commitment in the form of funding at a Senior Management level. This support needs to extend across the entire lifecycle depicted above (from identifying business use cases).
- Extensive but realistic ROI (Return On Investment) models built during due diligence with periodic updates for executive stakeholders
- On a similar note, ensuring buy in using a strategy of co-opting & alignment with Quants and different high potential areas of the business (as covered in the usecases in the last blog)
- Identifying leaders within the organization who can not only lead important projects but also create compelling content to evangelize the use of predictive analytics
- Begin to tactically bake in or embed data science capabilities across different lines of business and horizontal IT
- Slowly moving adoption to the Risk, Fraud, Cybersecurity and Compliance teams as part of the second wave. This is critical in ensuring that analysts across these areas move from a spreadsheet intensive model to adopting advanced statistical techniques
- Creating a Predictive Analytics COE (Center of Excellence) that enable cross pollination of ideas across the fields of statistical modeling, data mining, text analytics, and Big Data technology
- Informing the regulatory authorities of one’s intentions to leverage data science across the spectrum of operations
- Ensuring that issues related to data privacy,audit & compliance have been given a great deal of forethought
- Identifying & developing human skills in toolsets (across open source and closed source) that facilitate adapting to data lake based architectures. A large part of this is to organically grow the talent pool by instituting a college recruitment process
While this ends the current series on Data Science in financial services, it is my intention to explore each of the above Data Mining techniques to a greater degree of depth as applied to specific business situations in 2016 & beyond. This being said, we will take a look at another pressing business & strategic concern – Cybersecurity in Banking – in the next series.
1 comment
Outstanding Post!