Recent research questions the reliability of GPT-4‘s performance on the Uniform Bar Exam (UBE). When OpenAI released GPT-4 last year, it was said to have human-level performance on various professional and academic benchmarks, such as passing a simulated bar exam or writing creative stories.
A study published in the Artificial Intelligence and Law journal highlights numerous methodological flaws that could compromise the integrity of the reported scores. The study by Eric Martínez, a doctoral student in MIT’s brain and cognitive sciences department, argues that OpenAI’s estimation of GPT-4’s percentile rank on the UBE is likely exaggerated. The paper suggests these figures, presented as conservative, may not accurately represent the model’s genuine abilities. This becomes particularly problematic when these percentiles are assumed to be lower-bound estimates.
Methodological Flaws in GPT-4’s Bar Exam Performance
In its GPT-4 Technical report, OpenAI claimed that the large language model (LLM) achieved human-level performance, including passing a simulated bar exam. However, the study identifies significant methodological issues that could undermine these claims. The researcher argue that OpenAI’s estimate of GPT-4’s percentile rank on the UBE is likely exaggerated. They point out that these figures, presented as conservative, may not accurately reflect the model’s true abilities.
Hyperparameters and Prompting Techniques
Martinez analyzed how different hyperparameter settings, such as temperature adjustments and various prompting techniques, influence GPT-4’s performance on the Multistate Bar Examination (MBE). He found that changing temperature settings had minimal impact on performance.
Conversely, using few-shot chain-of-thought prompting substantially improved outcomes compared to straightforward zero-shot prompting, underscoring the importance of prompt engineering in AI performance evaluation.
Martinez recently addressed a New York State Bar Association continuing legal education course, examining if GPT-4’s bar exam results reflect its capability as a lawyer in a 90-minute dialogue with Luca CM Melchionna, the Tech and Venture Law Committee Chair of the New York State Bar Association’s Business Law Section.
“If I can draw the analogy of if you’re trying to pass a fitness test for the military and you need to get a 7-minute mile, you might not train very hard to get a faster time than that,” he said. “That doesn’t mean that you’re not capable of a much faster time, it might be more efficient to use resources elsewhere.”
“It seems the most accurate comparison would be against first-time test takers or to the extent that you think that the percentile should reflect GPT-4’s performance as compared to an actual lawyer, then the most accurate comparison would be to those who pass the exam,” Martinez pointed out.
Martinez also said that ChatGPT-4’s improvement from the 10th to the 90th percentile on the bar exam from its predecessor ChatGPT-3.5 far exceeded that of similarly related exams including the Law School Admission Test
(LSAT) where ChatGPT-4 raised its score by 40 percentage points.
In-depth Analysis of Performance and Grading
Martinez´study scrutinizes the grading processes for the Multistate Performance Test (MPT) and Multistate Essay Examination (MEE) sections. While they successfully replicated GPT-4’s MBE score, several flaws were identified in the methodologies used for grading the MPT and MEE sections. This casts doubt on the reliability of the reported essay scores.
The study’s conclusions highlight four main findings regarding OpenAI’s claim of GPT-4’s 90th percentile UBE performance:
-
Skewed Estimates: GPT-4’s score approaches the 90th percentile based on February administrations of the Illinois Bar Exam. However, these estimates are skewed towards low scorers, as the majority of test-takers in February failed the July administration and typically score lower than the general test-taking population.
-
Lower Percentile with July Data: Using July data from the same source, GPT-4’s performance estimate drops to the 68th percentile, including below-average performance on the essay portion.
-
Comparison with First-Time Test Takers: Comparing GPT-4’s performance against first-time test-takers results in an estimated 62nd percentile overall, with a 42nd percentile in the essay portion.
-
Performance Among Those Who Passed: When considering only those who passed the exam, GPT-4’s performance falls to the 48th percentile overall, and a mere 15th percentile on essays.
Additionally, the study questions the validity of GPT-4’s reported UBE score of 298. Despite replicating the MBE score of 158, methodological issues in grading the MPT and MEE sections call into question the validity of the essay score (140).
Broader Implications for AI and the Legal Profession
These findings have significant implications for both the legal profession and AI research. For the legal profession, the study suggests that practicing lawyers might find a sense of relief, as GPT-4 performs worse than many lawyers on the essay portion, which closely resembles real-world legal tasks. However, the widely publicized “90th percentile” claim could lead to inappropriate reliance on GPT-4 for complex legal tasks, potentially increasing the risk of legal errors and professional malpractice.
For AI research, the study underscores the importance of rigorous and transparent evaluation methods. The transparency of AI capabilities research is crucial for ensuring the safe and reliable development of AI systems. Implementing stringent transparency measures can help identify potential warning signs of transformative AI progress and prevent false alarms or unwarranted complacency.
The authors recommend that future studies focus on refining assessment methods and exploring the effects of various prompting techniques in greater detail. This approach could lead to more accurate and trustworthy evaluations of AI capabilities, particularly for complex tasks like the UBE.