An intelligent RA platform has a few core technology requirements (based on the above business requirements).
- A Single Data Repository – A shared data repository called a Data Lake is created, that can capture every bit of client data (explained in more detail below) as well as external data. The RA datalake provides more visibility into all data to a variety of different stakeholders. Wealth Advisors access processed data to view client accounts etc. Clients can access their own detailed positions,account balances etc. The Risk group accesses this shared data lake to processes more position, execution and balance data. Data Scientists (or Quants) who develop models for the RA platform also access this data to perform analysis on fresh data (from the current workday) or on historical data. All historical data is available for at least five years—much longer than before. Moreover, the Hadoop platform enables ingest of data across a range of systems despite their having disparate data definitions and infrastructures. All the data that pertains to trade decisions and lifecycle needs to be made resident in a general enterprise storage pool that is run on the HDFS (Hadoop Distributed Filesystem) or similar Cloud based filesystem. This repository is augmented by incremental feeds with intra-day trading activity data that will be streamed in using technologies like Sqoop, Kafka and Storm.
- Customer Data Collection – Existing Financial Data across the below categories is collected & aggregated into the data lake. This data ranges from Customer Data, Reference Data, Market Data & other Client communications. All of this data, can be ingested using a API or pulled into the lake from a relational system using connectors supplied in the RA Data Platform. Examples of data collected include – Customer’s existing Brokerage accounts, Customer’s Savings Accounts, Behavioral Finance Suveys and Questionnaires etc etc. The RA Data Lake stores all internal & external data.
- Algorithms – The core of the RA Platform are data science algos. Whatever algorithms are used – a few critical workflows are common to them. The first is Asset Allocation is to take the customers input in the “ADVICE” tab for each type of account and to tailor the portfolio based on the input. The others include Portfolio Rebalancing and Tax Loss Harvesting.
- The RA platform should be able to store market data across years both from a macro and from an individual portfolio standpoint so that several key risk measures such as volatility (e.g. position risk, any residual risk and market risk), Beta, and R-Squared – can be calculated at multiple levels. This for individual securities, a specified index, and for the client portfolio as a whole.
Illustration: Architecture of a Robo-Advisor (RA) Platform
The overall logical flow of data in the system –
- Information sources are depicted at the left. These encompass a variety of institutional, system and human actors potentially sending thousands of real time messages per hour or by sending over batch feeds.
- A highly scalable messaging system to help bring these feeds into the RA Platform architecture as well as normalize them and send them in for further processing. Apache Kafka is a good choice for this tier. Realtime data is published by a range of systems over Kafka queues. Each of the transactions could potentially include 100s of attributes that can be analyzed in real time to detect business patterns. We leverage Kafka integration with Apache Storm to read one value at a time and perform some kind of storage like persist the data into a HBase cluster.In a modern data architecture built on Apache Hadoop, Kafka ( a fast, scalable and durable message broker) works in combination with Storm, HBase (and Spark) for real-time analysis and rendering of streaming data.
- Trade data is thus streamed into the platform (on a T+1 basis), which thus ingests, collects, transforms and analyzes core information in real time. The analysis can be both simple and complex event processing & based on pre-existing rules that can be defined in a rules engine, which is invoked with Apache Storm. A Complex Event Processing (CEP) tier can process these feeds at scale to understand relationships among them; where the relationships among these events are defined by business owners in a non technical or by developers in a technical language. Apache Storm integrates with Kafka to process incoming data.
- For Real time or Batch Analytics, Apache HBase provides near real-time, random read and write access to tables (or ‘maps’) storing billions of rows and millions of columns. In this case once we store this rapidly and continuously growing dataset from the information producers, we are able to do perform super fast lookup for analytics irrespective of the data size.
- Data that has analytic relevance and needs to be kept for offline or batch processing can be stored using the Hadoop Distributed Filesystem (HDFS) or an equivalent filesystem such as Amazon S3 or EMC Isilon or Red Hat Gluster. The idea to deploy Hadoop oriented workloads (MapReduce, or, Machine Learning) directly on the data layer. This is done to perform analytics on small, medium or massive data volumes over a period of time. Historical data can be fed into Machine Learning models created above and commingled with streaming data as discussed in step 1.
- Horizontal scale-out (read Cloud based IaaS) is preferred as a deployment approach as this helps the architecture scale linearly as the loads placed on the system increase over time. This approach enables the Market Surveillance engine to distribute the load dynamically across a cluster of cloud based servers based on trade data volumes.
- It is recommended to take an incremental approach to building the RA platform, once all data resides in a general enterprise storage pool and makes the data accessible to many analytical workloads including Trade Surveillance, Risk, Compliance, etc. A shared data repository across multiple lines of business provides more visibility into all intra-day trading activities. Data can be also fed into downstream systems in a seamless manner using technologies like SQOOP, Kafka and Storm. The results of the processing and queries can be exported in various data formats, a simple CSV/txt format or more optimized binary formats, json formats, or you can plug in custom SERDE for custom formats. Additionally, with HIVE or HBASE, data within HDFS can be queried via standard SQL using JDBC or ODBC. The results will be in the form of standard relational DB data types (e.g. String, Date, Numeric, Boolean). Finally, REST APIs in HDP natively support both JSON and XML output by default.
- Operational data across a bunch of asset classes, risk types and geographies is thus available to investment analysts during the entire trading window when markets are still open, enabling them to reduce risk of that day’s trading activities. The specific advantages to this approach are two-fold: Existing architectures typically are only able to hold a limited set of asset classes within a given system. This means that the data is only assembled for risk processing at the end of the day. In addition, historical data is often not available in sufficient detail. Hadoop accelerates a firm’s speed-to-analytics and also extends its data retention timeline
- Apache Atlas is used to provide Data Governance capabilities in the platform that use both prescriptive and forensic models, which are enriched by a given businesses data taxonomy and metadata. This allows for tagging of trade data between the different businesses data views, which is a key requirement for good data governance and reporting. Atlas also provides audit trail management as data is processed in a pipeline in the lake
- Another important capability that Big Data/Hadoop can provide is the establishment and adoption of a Lightweight Entity ID service – which aids dramatically in the holistic viewing & audit tracking of trades. The service will consist of entity assignment for both institutional and individual traders. The goal here is to get each target institution to propagate the Entity ID back into their trade booking and execution systems, then transaction data will flow into the lake with this ID attached providing a way to do Client 360.
- Output data elements can be written out to HDFS, and managed by HBase. From here, reports and visualizations can easily be constructed. One can optionally layer in search and/or workflow engines to present the right data to the right business user at the right time.
Conclusion…
As one can see clearly, though automated investing methods are still in early stages of maturity – they hold out a tremendous amount of promise. As they are unmistakably the next big trend in the WM industry industry players should begin developing such capabilities.