Study Finds People Prefer AI Legal Advice Blind, Even If They Can Spot It

Despite being able to spot AI-written text, people have shown more willingness to act on LLM legal advice than lawyer advice when blind, a new study finds.

Research presented at the CHI 2025 conference reveals a curious pattern in how laypeople react to legal advice: they report being more willing to act on guidance from Large Language Models (LLMs) like ChatGPT than from human lawyers, but specifically when the source isn’t identified.

This preference emerges even though people demonstrate an ability to differentiate between AI and human-written text better than chance would predict, suggesting a complex relationship with automated expertise, as detailed in the paper titled “Objection Overruled! Lay People can Distinguish Large Language Models from Lawyers, but still Favour Advice from an LLM.”

The findings stem from a series of three experiments involving 288 participants. Researchers presented legal scenarios covering common issues like traffic, planning, and property law, based on real-world online queries sourced from platforms like Reddit’s r/LegalAdviceUK. They then showed participants corresponding advice generated either by OpenAI’s ChatGPT-4o or by specialized UK lawyers, measuring willingness to act or confidence in identifying the source.

Preference Depends On Knowing The Source

The core finding emerged from comparing participant reactions based on whether they knew who wrote the advice. When participants were unaware of the source (tested in Experiment 1’s ‘source unknown’ group and confirmed in Experiment 2), they consistently expressed a significantly higher willingness to act on the LLM’s advice compared to the human lawyers’ advice. However, when the source was explicitly labeled as ‘LLM’ or ‘lawyer’ (Experiment 1’s ‘source known’ group), this difference vanished; willingness-to-act ratings became statistically similar for both sources.

The researchers suggest this discrepancy might stem from social factors when the source is revealed; people might feel obligated to rate advice from a human professional higher, conforming to perceived expectations only when made explicitly aware.

Another factor could be the textual characteristics themselves. The study noted that while the LLM’s advice was generally shorter, its language registered as more complex based on Lix readability scores (a measure where higher scores indicate greater difficulty). The paper discusses that LLMs might present information with an air of confidence, contrasting with the more cautious, hedge-filled language typical of human lawyers concerned with liability and nuance.

Distinguishing AI, Yet Favoring It Blindly

Adding to the complexity, the third experiment showed participants aren’t completely oblivious to the source. Tasked with rating how likely each piece of advice was to be AI- versus human-generated (when unlabeled), participants performed significantly better than chance.

Using Receiver Operating Characteristic (ROC) analysis—a method from signal detection theory to measure discriminability independently of response bias—the study measured an Area Under the Curve (AUC) of 0.59. An AUC of 0.50 represents chance performance, while 1.0 is perfect accuracy, so 0.59 indicates a genuine, though imperfect, ability to discern the textual differences.

This presents a puzzle: even with some capacity to detect AI authorship, the default inclination when operating ‘blind’ appears to favor the LLM’s output.

The Reliability Shadow: Hallucination Risks

This preference for potentially undetectable AI advice is particularly relevant given the known reliability issues plaguing current LLMs. As these models expand into more sensitive advice-giving applications, like OpenAI’s recently updated Deep Research feature in ChatGPT, the risk of users acting on inaccurate information increases.

Reports following the mid-April 2025 release of OpenAI’s o3 and o4-mini models, for instance, pointed to higher rates of hallucination—generating plausible but false statements—compared to previous versions. OpenAI’s own data showed the o4-mini model produced incorrect or fabricated responses 48% of the time on one specific benchmark (PersonQA).

Research exploring why these powerful models hallucinate points to several potential factors related to their training. Theories include Reinforcement Learning (RL) processes possibly rewarding convincing fabrications if they lead to a seemingly correct final answer, challenges for Reinforcement Learning from Human Feedback (RLHF) when human raters cannot easily verify the accuracy of complex intermediate reasoning steps, and the models sometimes lacking access to their own step-by-step reasoning history between conversational turns.

Independent analysis by Transluce AI documented pre-release o3 models inventing actions they couldn’t perform. As Transluce researcher Neil Chowdhury suggested to TechCrunch, “Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines.” This backdrop of potential unreliability underscores the concerns raised by the CHI paper’s findings on user trust.

Broader Context and Moving Forward

The findings fit into a wider discussion about public trust in AI. While usage grows, a recent Pew Research Center study found continuing low public confidence in AI for sensitive guidance like medical or financial planning.

The CHI study adds nuance, suggesting this distrust might be more pronounced when the AI’s involvement is explicit, contrasting with the preference found in the blind scenario. The tendency to favor ‘blind’ AI advice also complicates efforts to ensure accountability when AI tools contribute to professional guidance.

The CHI paper authors propose improving AI literacy as one path forward, possibly using training methods developed for identifying misinformation. They also point to the importance of transparency measures, referencing regulations like the EU AI Act that mandate clear labeling of AI-generated content (Article 50) and risk disclosures for certain applications.

Ultimately, ensuring LLMs are clearly identifiable, rather than mimicking human experts, may be key to fostering more calibrated user trust. The full study materials, including data and analysis scripts used in the jsPsych/JATOS experiments, are available to researchers via the Open Science Framework.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x