Abstract Details

Title Performance of Successive Generative Pre-trained Transformer (GPT) Models in Medical Cases and Board-style Questions

Topic Sleep

Presentation(s) P9 - Poster Session 9 (5:00 PM-6:00 PM)

Poster/Presentation
Number 14-010

Objective To benchmark the performance trajectory of successive GPT models in sleep medicine, assessing diagnostic capacity with clinical vignettes and domain knowledge via board-style MCQs to test the hypothesis that performance gains are plateauing.

Background Large language models (LLMs) show rapid advances in clinical reasoning, yet their trajectory in specialized domains remains incompletely defined.

Design/Methods We conducted a comparative evaluation of of six OpenAI models—GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o, GPT-4.1, GPT-o3, and GPT-5. Performance was benchmarked on two datasets: 78 AASM case vignettes and 897 board-style multiple-choice questions (MCQs). Standardized single-best prompts were used, runs were independent, and default decoding was applied. Pairwise comparisons used McNemar’s exact tests with Holm–Bonferroni correction (two-sided α=0.05). Models were accessed via API July–September 2025; datasets included the American Academy of Sleep Medicine Case Book and subscription board-review banks to minimize training-data contamination.

Results Diagnostic accuracy rose with model generation: 74.4% (58/78) for GPT-3.5 Turbo; 73.1% for GPT-4-Turbo; 78.2% for GPT-4o; 89.7% for GPT-4.1; 93.6% for GPT-o3; and 91.0% for GPT-5. MCQ accuracy increased from 56.9% (510/897) with GPT-3.5 Turbo to 93.0% (834/897) with GPT-5 (GPT-o3, 92.4%; GPT-4.1, 85.4%). Advanced models significantly outperformed earlier iterations on both tasks after adjustment (P<0.05); on MCQs, GPT-5 and GPT-o3 were statistically indistinguishable. By disorder subgroup, GPT-o3 and GPT-4.1 achieved 100% accuracy for insomnia and other sleep disorders, whereas GPT-5 attained 100% accuracy for circadian rhythm and sleep-related movement disorders. Gains were smaller between the most recent generations, suggesting decelerating improvement.

Conclusions Successive generations of GPT models demonstrate significant and progressive improvements in both diagnostic reasoning and knowledge recall in sleep medicine. The observed performance plateau among state-of-the-art models suggests that while LLMs show promise as clinical decision support tools, future progress toward clinical-grade reliability may necessitate a strategic shift from generalist training to domain-specific fine-tuning with curated medical data.

Authors/Disclosures
Anshum Patel PRESENTER	Dr. Patel has nothing to disclose.
Het Contractor, MBBS	Dr. Contractor has nothing to disclose.
Hayden Heninger, None	Mr. Heninger has nothing to disclose.
Sai Krishna Vallamchetla, MBBS (Mayo Clinic, Florida)	Mr. Vallamchetla has nothing to disclose.
Pengze Li, PhD	Dr. Li has nothing to disclose.
Cui Tao, PhD	Dr. Tao has nothing to disclose.
Joseph Cheung, MD (Mayo Clinic)	Dr. Cheung has nothing to disclose.

��ɫ��

��ɫ��