Generative AI models are closing the gap with non-specialist doctors when it comes to medical diagnosis, but they remain considerably less accurate than human experts, according to a large-scale analysis from Osaka Metropolitan University. The research, led by Dr. Hirotaka Takita and Associate Professor Daiju Ueda, systematically reviewed 83 studies to compare AI performance against physicians, revealing an average AI diagnostic accuracy of 52.1%.
Published in Nature on March 22, the meta-analysis sifted through more than 18,000 papers published since June 2018. It evaluated a range of AI, including heavily studied models like GPT-4 as well as others specifically mentioned like Llama3 70B, Gemini 1.5 Pro, and Claude 3 Sonnet.
The core comparison showed AI’s diagnostic performance was statistically similar to that of non-expert physicians, with only a 0.6% difference favoring the humans. However, medical specialists maintained a clear edge, outperforming the AI models by a substantial 15.8% margin in accuracy.
Performance Varies by Field and Complexity
The AI models demonstrated variable success across different medical disciplines. They showed particular strength in dermatology, a field where visual pattern recognition – a forte of current AI – plays a large part. Yet, the researchers caution that dermatology also demands complex reasoning beyond visual matching.
Conversely, findings suggesting AI proficiency in urology were tempered by the fact they originated primarily from a single large study, limiting how broadly those results can be applied. Generally, the analysis indicated that AI tends to falter when dealing with complex cases that require interpreting extensive, detailed patient information, an area where specialists often excel through experience and nuanced clinical reasoning.
AI as Assistant, Not Replacement
Despite the accuracy deficit compared to specialists, the study highlights potential roles for AI in healthcare support and training. Osaka Metropolitan University, in an April 18, 2025 statement, quoted Dr. Takita on the possibilities: “This research shows that generative AI’s diagnostic capabilities are comparable to non-specialist doctors. It could be used in medical education to support non-specialist doctors and assist in diagnostics in areas with limited medical resources.”
This suggests a future where AI acts more as a supplementary tool, perhaps augmenting human capabilities rather than supplanting them, a view echoed in broader discussions about AI in medicine where combined human-AI performance often exceeds either alone.
Persistent Hurdles: Bias and Transparency
The enthusiasm for AI’s potential is balanced by notable challenges identified in the analysis. A key issue identified is the lack of transparency regarding the training data used for many commercial AI models. This opacity makes it difficult to assess potential biases or determine if a model’s performance can be generalized across different patient populations.
The researchers noted that transparency is essential for understanding a model’s knowledge and limitations. Quality assessment using the PROBAST tool rated 76% of the included studies as having a high risk of bias, often stemming from evaluations using small test datasets or insufficient detail about the AI’s training data affecting external validation assessments.
Some experts also worry that AI trained on general health records might inadvertently learn and replicate historical human diagnostic errors present in the data.
The Path Forward for Medical AI
The Osaka study arrives as efforts to build specialized medical AI continue, exemplified by tools like Bioptimus’s H-optimus-0 pathology model released in July 2024. The meta-analysis provides a necessary benchmark, evaluating the general diagnostic capability level these tools are reaching compared to human practitioners.
Looking ahead, Dr. Takita stressed the ongoing requirement for validation through more intricate clinical scenarios and clearer AI processes: “Further research, such as evaluations in more complex clinical scenarios, performance evaluations using actual medical records, improving the transparency of AI decision-making, and verification in diverse patient groups, is needed to verify AI’s capabilities.”