Experts Challenge Validity and Ethics of Crowdsourced AI Benchmarks Like LMArena (Chatbot Arena)

Following its incorporation, LMArena (Chatbot Arena) has encountered expert critiques concerning the validity and ethics of its widely cited AI model leaderboard.

A growing chorus of academics and AI ethics specialists is casting doubt on the reliability and fairness of popular crowdsourced platforms used to rank artificial intelligence models, directly challenging a method increasingly favored by tech giants like OpenAI, Google, and Meta.

At the center of this debate is LMArena, the platform formerly known as Chatbot Arena, whose head-to-head comparison system has become influential, yet faces pointed questions about its scientific grounding and the unpaid labor driving it, sparking discussion across the AI research community.

Background: From Research Project to Funded Startup

LMArena’s approach involves users interacting with two unnamed AI models and selecting the preferred output. These votes generate rankings using an Elo rating system, a method commonly used in chess to estimate relative skill levels.

This public leaderboard quickly became a go-to resource after its launch in early 2023 by researchers associated with UC Berkeley’s Sky Computing Lab, attracting a million monthly visitors and serving as a testing ground, sometimes even for unreleased models.

Recognizing the need for resources, the academic team—led by recent UC Berkeley postdoctoral researchers Anastasios Angelopoulos and Wei-Lin Chiang, alongside UC Berkeley professor and notable tech entrepreneur Ion Stoica (co-founder of Databricks and Anyscale)—established Arena Intelligence Inc. on April 18.

Operating under the LMArena brand, the new company aims to secure funding for expansion, stating “Becoming a company will give us the resources to improve LMArena significantly over what it is today.” This followed initial support from grants and donations by organizations including Google’s Kaggle platform, venture capital firm Andreessen Horowitz, and AI infrastructure company Together AI. Coinciding with the incorporation, a new beta website launched at beta.lmarena.ai, focused on improving speed and user experience.

Measuring What Matters? Validity Under Scrutiny

A central criticism probes whether LMArena’s crowdsourced voting truly assesses meaningful model qualities or reflects genuine user preference robustly. Emily Bender, a University of Washington linguistics professor, highlighted concerns about the benchmark’s underlying methodology in a statement to TechCrunch.

“To be valid, a benchmark needs to measure something specific, and it needs to have construct validity — that is, there has to be evidence that the construct of interest is well-defined and that the measurements actually relate to the construct,” Bender asserted.

She noted a lack of evidence showing that LMArena’s method effectively captures preference, stating, “Chatbot Arena hasn’t shown that voting for one output over another actually correlates with preferences, however they may be defined.” These current criticisms build upon existing scrutiny the platform faced regarding the subjectivity of votes, potential demographic biases in its user base, dataset transparency, and differing evaluation conditions for various model types.

Concerns extend to how results might be interpreted or potentially misrepresented. Asmelash Teka Hadgu, co-founder of AI firm Lesan, suggested labs might be “co-opted” into using platforms like LMArena to “promote exaggerated claims.” He cited the controversy around Meta’s Llama 4 Maverick model, where the company was criticized for benchmarking a specifically tuned version that reportedly outperformed the standard version eventually released to the public. Hadgu advocates for dynamic, independently managed benchmarks tailored to specific professional domains, utilizing paid experts.

The Ethics of Volunteer Evaluation

The platform’s reliance on unpaid user contributions also draws ethical examination. Kristine Gloria, formerly of the Aspen Institute, drew parallels to the often-exploitative data labeling industry, an issue some labs like OpenAI have faced questions over previously. While seeing value in diverse perspectives, Gloria maintains that crowdsourced benchmarks “should never be the only metric for evaluation” and risk becoming unreliable.

Matt Frederikson, CEO of Gray Swan AI, which uses crowdsourcing for AI red teaming, conceded that public benchmarks “aren’t a substitute” for internal testing and paid expert analysis. “It’s important for both model developers and benchmark creators, crowdsourced or otherwise, to communicate results clearly to those who follow, and be responsive when they are called into question,” Frederikson advised.

LMArena Defends Its Role and Looks Ahead

LMArena co-founder Wei-Lin Chiang pushes back against some characterizations, positioning the platform’s purpose differently. “Our community isn’t here as volunteers or model testers,” Chiang told TechCrunch.

“People use LM Arena because we give them an open, transparent place to engage with AI and give collective feedback. As long as the leaderboard faithfully reflects the community’s voice, we welcome it being shared.”

He attributed benchmark controversies to labs misinterpreting rules, not inherent design flaws, noting LMArena has updated policies for fairness. Co-founder Anastasios Angelopoulos added context to their goals, stating, “Our vision is that this will remain a place where everybody on the internet can come and try to chat and use AI, compare different providers and so on.”

This aligns with the company’s public declaration: “Our leaderboard will never be biased towards (or against) any provider, and will faithfully reflect our community’s preferences by design. It will be science-driven.”

As Arena Intelligence Inc. seeks funding and defines its business model—potentially charging companies for evaluations—it also plans a broad expansion beyond large language model comparisons. Specific initiatives mentioned include WebDev Arena, RepoChat Arena, and Search Arena, with future plans targeting vision models, AI agents, and dedicated AI red-teaming environments. This expansion arrives amid a wider industry discussion about evaluation methods, a point conceded by figures like OpenRouter CEO Alex Atallah, who agreed open testing alone “isn’t sufficient.”

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x