The Evolution of Alternative Data in Finance is Being Driven by LLMs

I have been working in alternative data (alt data) for nearly eight years on the business development side, and have had a front row seat for alt data’s evolution.

Alt data firms started to emerge about a decade ago, and the sector has since progressed through several phases. But everything changed in a flash with the emergence of large language models (LLM) and ChatGPT.

What is Alternative Data?

The term “alternative” implies that this type of data is outside of the traditional scope of financial data (e.g., balance sheets, income statements, cash flow statements). No longer treated as a niche strategy, alternative data is becoming increasingly important in investment analysis, as it can help investors gain a competitive edge by providing unique insights and uncovering previously unknown trends or patterns. In some cases, this can be accomplished in milliseconds.

Traditional financial data has been around for over fifty years with the focus being on tabular data. Financial database provider Compustat has been providing financial information on thousands of publicly-held companies with detailed information on their income, balance sheet, flow of funds, and supplemental data since the 1950s.

The demand for alternative data has been driven by the explosion of digital data, the availability of new technologies for processing and analyzing data, and the need for businesses to find new ways to gain insights and make better decisions. What we’re seeing is many of the most successful firms at the intersection of a Venn diagram with alt data and traditional data on either side, integrating the two types of data to draw deeper insights.

The emergence of alt data firms over a decade ago drove a watershed change in the amount and frequency of financial data available. The early adopters of alt data were systematic hedge funds and market makers. The financial industry as a whole embraced alternative data more slowly, with firms having to build teams to ingest, warehouse and interpret the data while integrating this data into their processes.

Firms relied heavily on third-party alt data vendors to normalize unstructured data and provide overlays like sentiment via natural language processing (NLP). The most significant leap in alt data came with the incorporation of AI in textual data, allowing the processing of massive amounts of text in any document format.

The Four Phases of Alternative Data

Alt Data Phase A

The first phase of alt data saw the emergence of start-ups focused on unique data sets such as news, geographic and satellite data, social media data, consumer spending patterns, payment histories, and fraud detection data.

One of the key advantages of alternative data is that it can provide insights beyond what is provided through traditional data sources. For example, social media data can provide real-time insights into consumer sentiment, while satellite imagery can provide insights into global supply chains. The use of credit card, foot, and web traffic data can predict sales in near real time, far ahead of quarterly reports.

Investors were beginning to see how alternative data could help businesses make better decisions and gain a competitive edge in their industries.

The alt data vendors during this initial phase specialized in a specific class of data. Alt data start-ups produced back-testable trading signals that were quickly adopted by quantitative systematic trading firms. This phase also saw the emergence of textual data parsing.

Alt Data Phase B

Textual alt data began to evolve into its second phase about five years ago when alt data firms started using their intellectual property (IP) to process long-standing, fundamental data sets such as SEC filings, earnings call transcripts and Q&A, and company CSR reports. More recently, firms have begun processing sell-side equity research and fund manager investment management letters.

In addition to the word “quantitative”, the new term “quantamental” was being used to describe participants who leveraged alt data in their fundamental research. (Quantamental is a merging of “quantitative” and “fundamental”.) The days of reading massive amounts of reports were starting to come to an end. Quantamental alt data can still point to documents with traceability if a quantamental analyst wishes to read the programmatic extraction.

The emergence of both quantitative and quantamental alternative data led to an increase of exchange trading volume being driven by programmatic systemic trading. This can include data that can be measured in milliseconds and used by decision makers and algorithmic execution models, as well as long-term holdings data used by asset managers and hedge funds.

In trading, the term “systemic” refers to a method of trading that relies on a predefined set of rules to make investment decisions. A systematic trading approach uses computer algorithms and mathematical models to identify trading opportunities and execute trades automatically, without the need for human intervention. This was made famous by Fama-French factor modeling, which was first introduced in the 1990s.

Systematic trading can be based on a variety of factors, such as technical indicators, fundamental data, statistical analysis, and alternative data. The goal of systematic trading is to remove emotions and biases from investment decisions and improve consistency in performance. By using a systematic approach, traders can potentially generate higher returns and reduce the risks associated with subjective decision-making.

Systematic trading has become a popular approach among professional traders and investors who are looking to achieve consistent returns, while minimizing their exposure to market volatility and human error.

Alt Data Phase C

Over the past year, a third phase of alt data has been emerging with firms focusing on proprietary data lakes. A data lake is a central repository of data stored in its structured or unstructured format.

Large market intelligence firms have deep data lakes. Many hedge funds, asset managers, banks, insurance firms, and others have large proprietary data lakes.

An example of a proprietary data lake would be a large hedge fund processing inbound emails from the sell side at the ticker level for alpha, or an asset manager or insurance company making sense of inbound chat, voice, email, URL search, and surveys.

NLP has evolved and is being customized for data sets within a firm’s proprietary data lake. Phase C of alternative data is still in its early stages and will continue to develop. But another phase, Phase Z, has taken over and drastically altered the conversation.

Alt Data Phase Z

Large language models (LLM) are the reason for the emergence of a Phase Z. Phase Z represents a dramatic state change for alternative data, setting it apart from the essentially linear developments of Phases A to C.

As I have discussed, it has been a ten year evolution in alternative data from Phase A to Phase C. But, with the release of ChatGPT, the conversation in finance has been redirected overnight.

Finance is making the monumental leap from data mining through curated data sets to having any question answered in a clear and concise manner.

A large language model (LLM) is a type of artificial intelligence model designed to process and generate human-like language on a large scale. LLMs use deep learning techniques to analyze vast amounts of natural language data to learn the patterns and structures of human language, enabling them to generate coherent and contextually appropriate language responses.

The development of LLMs has revolutionized natural language processing tasks such as language translation, text summarization, and chatbot development. Some of the most well-known types of LLMs include GPT (Generative Pre-trained Transformer) models, BERT (Bidirectional Encoder Representations from Transformers) models, and Transformer-XL.

Firms leverage Amenity’s (part of Symphony) solutions across A, B, C, and Z. Contact us to learn more!

LLMs have become increasingly sophisticated and powerful in recent years, with some models boasting trillions of parameters. These models have demonstrated impressive language processing capabilities, such as generating realistic and contextually appropriate language responses to complex questions, translating text between multiple languages, and even generating coherent and engaging creative writing.

The major distinction to be made when it comes to LLMs and the financial world is that financial firms are interested in large language models run on specific data sets or data within their proprietary data lakes on-prem. They want “bumpers” on the underlying data sources in order to prevent bad input data. Firms also do not want their data to become part of the public domain. As a result, financial firms will not use ChatGPT. In effect, ChatGPT will itself become an alt data source.

Additionally, LLMs will primarily drive the quantamental research and chatbot space, delivering a concise, curated information flow without the data mining layer.

Alternative data firms that already have large data lakes and sophisticated NLP capabilities have made a quick leap to utilizing LLM. In the financial alternative data space, firms that have LLM capabilities will drive both the quantitative and quantamental space to new heights within the next 12 to 18 months.

You may also like