Abstract Details

Title Benchmarking Large Language Models for Neurological Imaging Interpretation Using a Multiple Sclerosis Lesion Segmentation Dataset

Topic Multiple Sclerosis

Presentation(s) P6 - Poster Session 6 (5:00 PM-6:00 PM)

Poster/Presentation
Number 18-006

Objective

To comprehensively benchmark the ability of general-purpose LLMs to interpret structured MRI lesion data from a standardized multiple sclerosis (MS) lesion segmentation dataset (MSLesSeg), including lesion description and clinical interpretation.

Background

While some models (e.g., GPT-4) have exhibited impressive performance on clinical reasoning tasks, the ability of LLMs to interpret complex lesion data derived from neuroimaging has not yet been comprehensively investigated. MSLesSeg is a standardized expert-annotated MS lesion segmentation dataset that includes lesion metadata for 75 cases of MS that could serve as an ideal basis to benchmark how current LLMs handle structured neurological imaging information.

Design/Methods

Three widely used general-purpose LLMs like GPT, Claude, and Gemini, will be evaluated using standardized text prompts generated from MSLesSeg. Each case will include structured lesion data (volume, count, anatomical distribution across periventricular, juxtacortical, infratentorial, and spinal regions). Models will be tasked with: (1) classifying lesion patterns as typical or atypical for MS, (2) generating structured radiology-style lesion descriptions. Evaluation will include accuracy and F1 scores for classification tasks, and hallucination/error rate analysis. Intra-model consistency across repeated prompts will also be examined.

Results

We anticipate differences in performance across models, with stronger accuracy for typical MS lesion patterns compared to atypical or complex cases. Hallucination rates are expected to be nontrivial, particularly for infratentorial lesions. The analysis will provide comparative benchmarking data on model reliability, consistency, and interpretability. The analysis is ongoing, and results will be presented at the Annual Meeting.

Conclusions This study will establish one of the first benchmarks for evaluating general-purpose LLMs in the context of structured neurological imaging data. By leveraging a publicly available, expert-annotated MS lesion segmentation dataset, this work aims to provide actionable insights into the capabilities and limitations of current LLMs in clinical neuroimaging interpretation.

Authors/Disclosures
Vishrut Thaker, BS PRESENTER	Mr. Thaker has nothing to disclose.
Isheeta Gupta, MBBS	Dr. Gupta has nothing to disclose.

��ɫ��

��ɫ��