Risk management is not just a defensive business imperative but the best managed banks deploy their capital to obtain the best possible business outcomes. The last few posts have more than set the stage from a business and regulatory perspective. This one will take a bit of a deep dive into the technology.
Existing data architectures are siloed with bank IT creating or replicating data marts or warehouses to feed internal lines of business. These data marts are then accessed by custom reporting applications thus replicating/copying data many times over which leads to massive data management & governance challenges.
Furthermore, the explosion of new types of data in recent years has put tremendous pressure on the financial services datacenter, both technically and financially, and an architectural shift is underway in which multiple LOBs can consolidate their data into a unified data lake.
Banking data architectures and how Hadoop changes the game
Most large banking infrastructures , on a typical day, process millions of derivative trades. The main implication is that there are a large number of data inserts and updates to handle. Once the data is loaded into the infrastructure there needs to be complex mathematical calculations that need to be done in near real time to calculate intraday positions. Most banks use techniques like Monte Carlo modeling and other computational simulations to build & calculate these exposures. Hitherto, these techniques were extremely expensive from both the cost of hardware and software needed to run them. Neither were tools & projects available that supported a wide variety of data processing paradigms – batch, interactive, realtime and streaming.
The Data Lake supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set which is the unified repository of all financial data, it also enables users to transform and view data in multiple ways (across various schemas) and deploy closed-loop analytics applications that bring time-to-insight closer to real time than ever before.
Figure 1 – From Data Silos to a Data Lake
Also, with the advent and widespread availability of Open Source software like Hadoop (I mean a full Hadoop platform ecosystem with Hortonworks Data Platform HDP, and it’s support of multiple computing frameworks like Storm, Spark, Kafka, MapReduce and HBase) which can turn a cluster of commodity x86 based servers into a virtual mainframe, cost is no longer a limiting factor. The Application ecosystem of a financial institution can now be a deciding factor in how data is created, ingested, transformed and exposed to consuming applications.
Thus clusters of inexpensive x86 servers running Linux and Hortonworks Data Platform (HDP) provide an extremely cost-effective environment for deploying and running simulations and stress tests.
Figure 2 – Hadoop now supports multiple processing engines
Finally, an HDP cluster with tools like Hadoop, Storm, and Spark is not limited to one purpose, like older dedicated-computing platforms. The same cluster you use for running stress tests can also be used for text mining, predictive analytics, compliance, fraud detection, customer sentiment analysis, and many many other purposes. This is a key point, once you can bring in siloed data into a data lake, it is available to running multiple business scenarios – limited only by the overall business scope.
Now typical Risk Management calculations require that for each time point, and for each product line, separate simulations are run to derive higher order results. Once this is done, the resulting intermediate data then needs to be aligned to collateral valuations, derivate settlement agreements and any other relevant regulatory data to arrive at a final portfolio position. Further there needs to be a mechanism to pull in data that needs be available from a reference perspective for a given set of clients and/or portfolios.
The following are the broad architectural goals for any such implementation –
* Provide a centralized location for aggregating at a housewide level and subsequent analysis of market data, counterparties, liabilities and exposures
* Support the execution of liquidity analysis on a intraday or multi-day basis while providing long term data retention capabilities
* Provide strong but optional capailities for layering in business workflow and rule based decisioning as an outcome of analysis
* Support the execution of liquidity analysis on a intraday or multi-day basis while providing long term data retention capabilities
* Provide strong but optional capailities for layering in business workflow and rule based decisioning as an outcome of analysis
At the same time, long term positions need to be calculated for stress tests, for instance, typically using at least 12 months of data pertaining to a given product set. Finally the two streams of data may be compared to produce a CVA (Credit Valuation Adjustment) value.
The average Investment Bank deals with potentially 50 to 80 future dates and upto 3,000 different market paths, thus computation resource demands are huge. Reports are produced daily, and under special conditions multiple times per day. What-if scenarios with strawman portfolios can also be run to assess regulatory impacts and to evaluate business options.
Figure 3 – Overall Risk Mgmt Workflow
As it can be seen from the above, computing arbitrary functions on a large and growing master dataset in real time is a daunting problem (to quote Nathan Marz). There is no single product or technology approach that satisfies all business requirements. Instead, one has to use a variety of tools and techniques to build a complete Big Data system. I present two approaches both of whom have been tried and tested in enterprise architecture.
Solution Patterns
Pattern 1 – Integrate a Big Data Platform with an In memory datagrid
There are broad needs for two distinct data tiers that can be identified based on the business requirements above –
- It is very clear from the above that data needs to be pulled in near realtime, accessed in a low latency pattern as well as calculations performed on this data. The design principle here needs to be “Write Many and Read Many” with an ability to scale out tiers of servers.In memory datagrids (IMDGs) are very suitable for this use case as they support a very high write rate. IMDGs like GemFire & JBOSS Data Grid (JDG) are highly scalable and proven implementations of distributed datagrids that gives users the ability to store, access, modify and transfer extremely large amounts of distributed data. Further, these products offers a universal namespace for applications to pull in data from different sources for all the above functionality. A key advantage here is that datagrids can pool memory and can scaleout across a cluster of servers in a horizontal manner. Further, computation can be pushed into the tiers of servers running the datagrid as opposed to pulling data into the computation tier.
To meet the needs for scalability, fast access and user collaboration, data grids support replication of datasets to points within the distributed data architecture. The use of replicas allows multiple users faster access to datasets and the preservation of bandwidth since replicas can often be placed strategically close to or within sites where users need them. IMDGs supports WAN replication, clustering, out of the box replication as well as support for multiple language clients. - The second data access pattern that needs to be supported is storage for data ranging from next day to months to years. This is typically large scale historical data. The primary data access principle here is “Write Once, Read Many”. This layer contains the immutable, constantly growing master dataset stored on a distributed file system like HDFS. The HDFS implementation in HDP 2.x offers all the benefits of a distributed filesystem while eliminating the SPOF (single point of failure) issue with the NameNode in a HDFS Cluster. With batch processing (MapReduce) arbitrary views – so called batch views are computed from this raw dataset. So Hadoop (MapReduce on YARN) is a perfect fit for the concept of the batch layer. Besides being a storage mechanism, the data stored in HDFS is formatted in a manner suitable for consumption from any tool within the Apache Hadoop ecosystem like Hive or Pig or Mahout.
Figure 3 – System Architecture
The overall system workflow is as below –
- Data is injected into the architecture in either an event based manner or in a batch based manner. HDP supports multiple ways of achieving this. One could either use a high performance ingest like Kafka or an ESB like Mule for the batch updates or directly insert data into the IMDG via a Storm layer. For financial data stored in RDBMS’s, one can write a simple Cacheloader to prime the grid.Each of these approaches offers advantages to the business. For instance, using CEP one can derive realtime insights via predefined business rules and optionally spin up new workflows based on those rules. Once the data is inserted into the grid, one can have the Grid automatically distribute the data via Consistent Hashing. Once the data is all there, fast incremental algorithms are run in memory and resulting data can be stored in a RDBMS for querying by Analytics/ Visualisation applications.
Such intermediate or data suitable for modeling or simulation can also be streamed into the long term storage layer.
Data is loaded into different partitions into the HDFS layer in two different ways – a) from the datasources themselves directly; b) from the JDG layer via a connector .
Pattern 2 – Utilize the complete featureset present in a Big Data Platform like Hortonworks HDP 2.3
** this integration was demonstrated at Red Hat Summit by Hortonworks, Mammoth Data and Red Hat and is well captured at *http://mammothdata.com/big-data-open-source-risk-managment/*
The headline is self explanatory but let’s briefly examine how you might perform a simple Monte Carlo calculation using Apache Spark. Spark is the ideal choice here due to the iterative nature of these calculations as well as the natural increase in performance in doing this from an in memory perspective. Spark enables major performance gains – applications in Hadoop clusters running Spark tend to run up to 100 times faster in memory and 10 times faster – this even on disk.
Apache Spark provides a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).
A major advantage of using Spark is that it allows programmers to develop complex, multi-step data pipelines using the directed acyclic graph (DAG) pattern while supporting in-memory data sharing across DAGs, so that data can be shared across jobs.
One important metric used in financial modeling is LVaR – Liquidity Adjusted Value at Risk. As we discussed in earlier posts, an important form of risk is Liquidity Risk, and LVaR is one important metric to represent Liquidity Risk. “Value at Risk” or VaR is no more than the probability that a given portfolio will exceed a given threshold loss over a given period of time.
For mathematical details of the calculation, please see Extreme value methods with applications to finance by S.Y. Novak.
Now, Liquidity risk is divided into two types: funding liquidity risk (i.e can we make the payments on this position or liability? ) and market liquidity risk (where we ask – can we exit this position if the market suddenly turns illiquid).
The incorporation of external liquidity risk into a VaR results in LVaR. This essentially means adjusting the time period used in the VaR calculation, based on the expected length of time required to unwind the position.
Given that we have a need to calculate LVaR for a portfolio, we can accomplish this in a distributed fashion using Spark by doing the following:
- Implementing the low-level LVaR calculation in Java, Scala, or Python. With Spark it is straightforward to work with code written in any of these three languages. Spark also provides mature support for multiple programming languages – Java, Scala, Python etc & ships with a built-in set of over 80 high-level operators.
- Data Ingestion – all kinds of financial data – position data, market data, existing risk data, General Ledger etc – is batched in i.e read from flat files stored in HDFS, or the initial values can be read from a relational database or other persistent store via Sqoop.
- Spark code written in Scala, Java or Python can leverage the database support provided by those languages. Once the data is read in, it resides in what Spark calls a RDD – a Resilient Distributed Dataset. A convenient representation of the input data, which leverages Spark’s fundamental processing model, would include in each input record the portfolio item details, along with the input range, and probability distribution information needed for the Monte Carlo simulation.
- If you have streaming data requirements, you can optionally leverage Kafka integration with Apache Storm to read one value at a time and perform some kind of storage like persist the data into a HBase cluster.In a modern data architecture built on Apache Hadoop, Kafka ( a fast, scalable and durable message broker)works in combination with Storm, HBase and Spark for real-time analysis and rendering of streaming data. Kafka has been used to message geospatial data from a fleet of long-haul trucks to financial data to sensor data from HVAC systems in office buildings.
- The next step is to perform a transformation on each input record (representing one portfolio item) which runs the Monte Carlo simulation for that item. The distributed nature of Spark will result in each simulation running in a unique worker process somewhere on one node in the overall cluster.
- After each individual simulation has run, running another transform over the RDD to perform any aggregate calculations, such as summing the portfolio threshold risk across all instruments in the portfolio at each given probability threshold.
- Output data elements can be written out to HDFS, or stored to a database like Oracle, HBase, or Postgres. From here, reports and visualizations can easily be constructed.
- Optionally layering in workflow engines can be used to present the right data to the right business user at the right time.
Whether you choose one solution pattern over the other or mix both of them depends on your complex business requirements and other characteristics including-
- The existing data architecture and the formats of the data (structured, semi structured or unstructured) stored in those systems
- The governance process around the data
- The speed at which the data flows into the application and the velocity at which insights need to be gleaned
- The data consumers who need to access the final risk data whether they use a BI tool or a web portal etc
- The frequency of processing of this data to produce risk reports i.e hourly or near real time (dare I say?) ad-hoc or intraday