好色先生

好色先生

Explore the latest content from across our publications

Log In

Forgot Password?
Create New Account

Loading... please wait

Abstract Details

Performance of Successive Generative Pre-trained Transformer (GPT) Models in Medical Cases and Board-style Questions
Sleep
P9 - Poster Session 9 (5:00 PM-6:00 PM)
14-010
To benchmark the performance trajectory of successive GPT models in sleep medicine, assessing diagnostic capacity with clinical vignettes and domain knowledge via board-style MCQs to test the hypothesis that performance gains are plateauing.
Large language models (LLMs) show rapid advances in clinical reasoning, yet their trajectory in specialized domains remains incompletely defined.
We conducted a comparative evaluation of of six OpenAI models—GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o, GPT-4.1, GPT-o3, and GPT-5. Performance was benchmarked on two datasets: 78 AASM case vignettes and 897 board-style multiple-choice questions (MCQs). Standardized single-best prompts were used, runs were independent, and default decoding was applied. Pairwise comparisons used McNemar’s exact tests with Holm–Bonferroni correction (two-sided α=0.05). Models were accessed via API July–September 2025; datasets included the American Academy of Sleep Medicine Case Book and subscription board-review banks to minimize training-data contamination.
Diagnostic accuracy rose with model generation: 74.4% (58/78) for GPT-3.5 Turbo; 73.1% for GPT-4-Turbo; 78.2% for GPT-4o; 89.7% for GPT-4.1; 93.6% for GPT-o3; and 91.0% for GPT-5. MCQ accuracy increased from 56.9% (510/897) with GPT-3.5 Turbo to 93.0% (834/897) with GPT-5 (GPT-o3, 92.4%; GPT-4.1, 85.4%). Advanced models significantly outperformed earlier iterations on both tasks after adjustment (P<0.05); on MCQs, GPT-5 and GPT-o3 were statistically indistinguishable. By disorder subgroup, GPT-o3 and GPT-4.1 achieved 100% accuracy for insomnia and other sleep disorders, whereas GPT-5 attained 100% accuracy for circadian rhythm and sleep-related movement disorders. Gains were smaller between the most recent generations, suggesting decelerating improvement.
Successive generations of GPT models demonstrate significant and progressive improvements in both diagnostic reasoning and knowledge recall in sleep medicine. The observed performance plateau among state-of-the-art models suggests that while LLMs show promise as clinical decision support tools, future progress toward clinical-grade reliability may necessitate a strategic shift from generalist training to domain-specific fine-tuning with curated medical data.
Authors/Disclosures
Anshum Patel
PRESENTER
Dr. Patel has nothing to disclose.
Het Contractor, MBBS Dr. Contractor has nothing to disclose.
Hayden Heninger, None Mr. Heninger has nothing to disclose.
Sai Krishna Vallamchetla, MBBS (Mayo Clinic, Florida) Mr. Vallamchetla has nothing to disclose.
Pengze Li, PhD Dr. Li has nothing to disclose.
Cui Tao, PhD Dr. Tao has nothing to disclose.
Joseph Cheung, MD (Mayo Clinic) Dr. Cheung has nothing to disclose.