While companies explore plans for AI agents to take over complex jobs, even aiming to automate tens of thousands of federal roles, a dose of reality comes from Carnegie Mellon University researchers.
Their detailed benchmark, simulating a software company staffed entirely by AI, found that current agents struggle mightily with realistic professional tasks. The study, dubbed “TheAgentCompany,” found even the top performer completed less than a quarter of assigned duties successfully, challenging narratives suggesting AI is on the verge of widespread job automation.
TheAgentCompany benchmark, detailed in an arXiv paper, placed AI agents within a detailed simulation of a small software firm. This environment included internal websites built using open-source platforms like GitLab (code hosting), OwnCloud (office suite), Plane (project management), and RocketChat (internal communication), alongside a sandboxed local workspace with terminal and coding access.
Agents, primarily run using the OpenHands agent framework (an open-source system for building agents that can operate computer applications), were assigned 175 tasks spanning software engineering, finance, HR, project management, and administrative duties. These tasks were designed based on real-world job descriptions from sources like the O*NET database and author experience.
Evaluation involved automated checks against predefined checkpoints, awarding partial credit for intermediate progress, sometimes using LLM-based evaluation for more subjective checks. The environment also featured simulated colleagues – NPCs powered by Anthropic’s Claude 3.5 Sonnet via the Sotopia platform (a framework for creating simulated social environments) – to test interaction capabilities.
AI Agents Stumble on Everyday Tasks
The results paint a picture of nascent, often clumsy, capability. Anthropic’s Claude 3.5 Sonnet led the pack but only achieved a 24.0% full task completion rate (34.4% partial score). This performance came at a considerable operational expense, averaging over $6 and nearly 30 interaction steps per task. Google’s Gemini 2.0 Flash was notably cheaper ($0.79/task) but much slower (almost 40 steps) and less successful (11.4%). OpenAI’s GPT-4o registered 8.6% success ($1.29/task), while Meta’s open-weight Llama 3.1 405b achieved 7.4% ($3.21/task). Other models, including Amazon’s Nova Pro v1 (1.7%), trailed further behind. These low success rates stem from a range of observed issues.
Where Agents Falter
Analysis of the failures pointed to fundamental limitations in the agents’ capabilities. Basic common sense often seemed absent; agents might treat a “.docx” file like plain text or, as noted in one source, prove unable to dismiss an “innocuous pop-up” blocking necessary files. Social skills were also weak, with agents using the simulated RocketChat system misinterpreting conversations or failing to follow up appropriately.
The researchers documented one instance where an agent, unable to find the correct contact in the chat system, “decides to create a shortcut solution by renaming another user to the name of the intended user.” Navigating complex web UIs proved particularly difficult, especially within the OwnCloud office suite environment. The researchers broadly identified common failure points as a lack of common sense, poor social skills, and incompetence in web browsing.
Uneven Progress Across Different Work Types
Performance wasn’t uniform across task categories. Agents generally fared better with software development engineering (SDE) tasks compared to roles in administration, finance, or data science, where success rates were often near zero. The researchers hypothesize this disparity might stem from the vast amount of public code available for training models on SDE tasks, whereas workflows for administrative or financial jobs are often proprietary and less represented in training data.
The ability to interact with different platforms also varied. Agents showed particular difficulty with tasks involving the RocketChat communication platform and the OwnCloud office suite, suggesting that both social reasoning and complex web UI navigation remain major hurdles. Performance on tasks involving GitLab (code hosting) and Plane (project management) was comparatively better, though still far from reliable.
A Reality Check for Automation Ambitions
These benchmark results provide a stark contrast to the high expectations and ongoing development efforts within the tech industry. Microsoft began previewing “computer use” agents in Copilot Studio in April 2025, aiming to automate GUI interactions. OpenAI is reportedly exploring high-cost “PhD-level” research agents for enterprise automation earlier in March 2025.
Perhaps most strikingly, plans linked to Elon Musk’s DOGE initiative surfaced in late April, involving recruitment for a project aiming to deploy AI agents capable of replacing the work equivalent of “at least 70k FTEs” within a year. This proposal was met with internal skepticism within a Palantir alumni network, with one critic retorting, “You’re complicit in firing 70k federal employees and replacing them with shitty autocorrect.” TheAgentCompany findings underscore the feasibility questions surrounding such large-scale automation plans.
The agents’ struggles in the benchmark align with known weaknesses inherent in current AI models. Anthropic’s Chief Information Security Officer warned in April 2025 that industry is unprepared for the security and management challenges posed by autonomous “virtual employees,” highlighting known issues like AI hallucination and vulnerability to prompt injection.
The difficulty agents faced with communication and complex instructions in TheAgentCompany reflects these underlying challenges, recently exemplified when Cursor AI’s support bot reportedly invented a non-existent company policy in late April 2025. The Carnegie Mellon researchers concluded that while agents might accelerate portions of human work, they are “likely not a replacement for all tasks at the moment.”
They drew parallels to the machine translation market, where efficiency gains led to increased demand rather than mass job displacement for human translators. Companies currently experimenting with agents, like Johnson & Johnson, emphasize keeping humans involved, viewing AI as a tool for collaboration rather than replacement for the foreseeable future.