Researchers from the University of Chicago have discovered GPT-4, the large language model (LLM) developed by OpenAI, can analyze financial statements with remarkable accuracy, often surpassing professional analysts. The findings, detailed in a working paper titled “Financial Statement Analysis with Large Language Models”, indicate potential shifts in the financial analysis landscape.
The researchers have developed an interactive web application to demonstrate GPT-4's financial analysis capabilities. This tool is available to ChatGPT Plus subscribers, allowing users to explore the AI's predictive power in real time. The application aims to provide a hands-on experience of how GPT-4 processes financial data and generates insights. The research involved analyzing the financial statements of publicly listed enterprises, with a focus on predicting future earnings growth.
Chain-of-Thought Prompts Enhance Analysis
The study employed “chain-of-thought” prompts to guide GPT-4 through the analytical process typically used by financial analysts. Chain-of-thought prompting is a technique in prompt engineering designed to enhance the performance of language models on tasks that demand logic, calculation, and decision-making. This is achieved by organizing the input prompt to reflect the process of human reasoning.
By providing standardized, anonymized balance sheets and income statements, the researchers enabled GPT-4 to identify trends, compute ratios, and synthesize information. This method allowed the AI to achieve a 60% accuracy rate in predicting future earnings growth, outperforming human analysts who typically achieve a 53-57% accuracy range.
The study compared GPT-4's performance to that of human analysts and specialized machine learning models. They found that GPT-4's accuracy is on par with narrowly trained state-of-the-art models, highlighting the AI's ability to perform complex numerical reasoning tasks. This comparison underscores the potential of large language models to handle specialized financial analysis tasks traditionally reserved for human experts.
Skepticism and Benchmark Concerns
Despite the promising results, some experts have expressed skepticism regarding the study's benchmarks. AI researcher Matt Holden questioned the validity of comparing GPT-4's performance to that of human analysts in stock picking, citing the use of an outdated artificial neural network model from 1989 as a benchmark. This concern points to the need for further validation and comparison with more contemporary models.
Not sure about this framing. Seems misleading, no?
The “median analyst” can't actually successfully “pick stocks” and beat a simple vanguard index fund, so why compare that with an LLM?
I don't doubt an LLM can outperform median analysts at specific tasks like writing…
— Matt Holden (@holdenmatt) May 24, 2024
Numerical analysis has historically been a challenging area for language models, which excel in textual tasks but often struggle with numbers. Alex Kim, one of the study's co-authors, emphasized that LLMs typically derive their understanding of numbers from narrative context. The study's findings are therefore particularly noteworthy, as they demonstrate GPT-4's ability to perform complex judgments and computations in the numerical domain.
The ability of GPT-4 to match and exceed human analysts in financial predictions suggests a potential transformation in the role of financial analysts. While human expertise and judgment remain invaluable, tools like GPT-4 could augment analysts' capabilities, providing more accurate and efficient financial insights. This points to a future where AI and human analysts work in tandem to enhance financial decision-making.