好色先生

好色先生

Explore the latest content from across our publications

Log In

Forgot Password?
Create New Account

Loading... please wait

Abstract Details

A Comparative Analysis of Next-generation Large Language Models in Neurological Diagnostics
General Neurology
S39 - General Neurology 2 (12:03 PM-12:15 PM)
005
To evaluate and compare the diagnostic accuracy and confidence calibration of two next-generation Large Language Models (LLMs), Gemini 2.5 Pro and ChatGPT-5, on a diverse set of complex neurological case vignettes.
The application of LLMs in clinical diagnostics is rapidly evolving. However, rigorous comparative evaluation of leading models like Gemini 2.5 Pro and ChatGPT-5 on complex neurological cases with multimodal data remains limited. Such benchmarks are critical to understanding their potential role and limitations in clinical decision support.
A dataset of 51 neurological case vignettes, sourced from academic materials and encompassing a wide range of diagnoses, was used. Each case included patient history, examination findings, and test results. Both LLMs were presented with the same standardized prompt for each case and asked to provide a final diagnosis. Diagnostic accuracy was determined by comparison to a gold-standard diagnosis. A McNemar's test was used to compare accuracy rates. Model confidence scores (0-10) were analyzed using t-tests to compare calibration between correct and incorrect diagnoses.
ChatGPT-5 was correct in 45/50 (90.0%, 95% CI 78.6–95.7%); Gemini in 44/50 (88.0%, 95% CI 76.2–94.4%). The McNemar's test revealed no statistically significant difference in their performance (p=0.739). However, the models differed significantly in confidence calibration. ChatGPT-5 demonstrated superior calibration, with mean confidence for correct diagnoses being significantly higher than for incorrect diagnoses (9.42 vs. 7.80, p=0.04). Gemini 2.5 Pro's confidence was uniformly high and did not differ significantly between correct (9.66) and incorrect (9.67) diagnoses. Qualitative analysis identified instances of critical, high-confidence errors by both models.
Our findings suggest that Both models achieved high accuracy with no significant paired difference. This highlights their potential as supportive tools in neurological education and clinical decision-making. Future studies should focus on real-world clinical validation and the integration of these tools into safe, clinician-led workflows.
Authors/Disclosures
Aditi Agarwal
PRESENTER
Aditi Agarwal has nothing to disclose.
Anshum Patel Dr. Patel has nothing to disclose.
Ipshita Garg, MBBS (University of Texas at Tyler) Dr. Garg has nothing to disclose.