- Internal Evaluation: Bridgewater Associates and Thinking Machines Lab say a tuned Qwen3-235B model outperformed GPT, Claude, and Gemini variants in an internal finance-task evaluation.
- Expert Feedback: Expert labels, prompt rules, and fine-tuning helped encode private workflow judgments that public web knowledge lacked.
- Cost Caveat: The companies reported 84.7 percent accuracy and a 13.8 times inference-cost reduction, but the figures are company-run measurements.
- Deployment Limits: Financial firms still need GPUs, latency tuning, engineering staff, and maintenance as filings and regulations change.
Bridgewater Associates’ AIA Labs and Thinking Machines Lab say a fine-tuned Qwen3-235B open-weight model outperformed leading commercial AI models in a finance-task evaluation. Bridgewater AIA Labs is Bridgewater’s artificial-intelligence research and investment lab, and open-weight means the model’s weights can be accessed or adapted rather than only queried through a closed service.
In the same internal evaluation, the trained model reached 84.7 percent accuracy versus 78.2 percent for the strongest frontier model tested and reduced inference cost per 1,000 tasks by 13.8 times compared with that alternative.
For financial firms deciding whether to automate document triage, the numbers are company-run measurements, not an independent public benchmark.
Why Private Investor Judgment Mattered
Document triage, not open-ended idea generation, supplied the setting for the model comparison. Relevance decisions were hard to automate because a correct answer often depended on Bridgewater’s private workflow rather than public web knowledge.
Variants of Google’s Gemini models, Anthropic’s Claude models, and OpenAI’s GPT model family averaged roughly 50 percent accuracy when given only the task descriptions.
Expert-written prompts pushed frontier-model averages into the mid-70 percent range, still below the authors’ 80 percent threshold for trustworthy deployment. GPT 5.4 also cost 43 percent more than GPT 5.2 while delivering only marginal accuracy gains on the evaluated finance-document tasks, reinforcing the narrow internal-test caveat.
Contractor labels were not enough on their own. A training data cleanup process routed examples to investment experts when a model trained on vendor labels disagreed with those labels. Custom fine-tuned models may outperform on domain-specific tasks requiring expert judgment, but the Bridgewater workflow points to a narrower lesson: private feedback, labels, review rules, and corrections gave the model a repeatable path to investor judgment.
Bridgewater’s training run used Thinking Machines Lab’s Tinker platform. In 2025, the Murati-led AI startup launched Tinker API for model customization before the Bridgewater evaluation. For this project, the Tinker training API let researchers control model training while the company handled infrastructure.
Thinking Machines Lab paired Tinker with Qwen3-235B for task-specific extra training, not a general chatbot rollout. Tinker uses LoRA and limits customer data so organizations can fine-tune a smaller adapter while keeping customer data tied to customer models.
Financial teams get a clearer control condition when research includes sensitive client and strategy information. Beyond the adapter method, Tinker exposes forward-backward, optimizer-step, sampling, and save-state functions for model training workflows. Recipe details also included on-policy distillation, where a student model learns from its own attempted outputs while stronger teachers grade them.
What Bridgewater’s Six-Task Result Does Not Prove
Bridgewater warns that AI-tool outputs can contain inaccuracies, errors, defects, or security vulnerabilities found only after use, keeping the finance-task numbers tied to careful deployment rather than automated trust.
Independent enterprise-AI commentator and analyst Vijay Vijayasankar frames the Bridgewater result as a scope caveat rather than a verdict on every larger model. He called the advantage “a better feedback loop” because expert correction, task definition, and deployment constraints matter alongside model size. A large adaptable model still requires GPUs, batching, latency tuning, and engineering staff, so lower inference cost does not remove the need to maintain a narrow document-filtering system.
Bridgewater and Thinking Machines Lab still have to show whether the six-task evaluation holds up in broader, independently checked financial workflows. Fresh filings, central-bank documents, and regulatory language can shift the judgments a model must learn. For financial firms, those changes will test whether ongoing model maintenance can keep the reported accuracy edge intact outside the company-run evaluation.


