This study fills an existing gap by exploring the use of ChatGPT in the field of medical image-based questions and case-based teaching scenarios. Our findings demonstrate that ChatGPT exhibit the capability to accurately answer image-based medical questions, with the GPT-4o version demonstrating a numerically higher accuracy. Prompt engineering enables ChatGPT to assist in the design of lesson plans, showcasing significant potential in medical education. Nevertheless, human verification and correction remain essential. These findings may serve as a significant step toward expanding the practical use of advanced ChatGPT versions, which have already demonstrated great potential in medical fields.
Previous studies have widely assessed the performance of ChatGPT on medical questions, predominantly utilizing earlier versions such as GPT-3.5 and GPT-4, which yielded inconsistent accuracy levels [1, 3, 4, 13,14,15, 17, 26, 27]. However, these studies excluded image-based questions due to the limitations of earlier versions, hindering potentials of ChatGPT in real-world teaching contexts. In response, our study specifically targeted medical image-based questions to fill this gap. The accuracy of ChatGPT in our findings is consistent with previous studies [14,15,16,17], while the higher version, GPT-4o, exhibited greater accuracy (nearly 90%) [18]. While GPT-4o demonstrated higher accuracy than GPT-4 across the image-based items, this difference did not reach statistical significance, likely due to the limited sample size. Nonetheless, the consistent directional trend aligns with recent findings on the evolving capabilities of multimodal LLMs, and supports further investigation in larger-scale evaluations.
In addition to evaluating model performance, our findings highlight broader implications for the application of large vision-language models (LVLMs) in medical education—particularly in the areas of assessment design and instructional content development. LVLMs such as GPT-4o may assist in generating multiple-choice questions from image-based course materials, offering opportunities to streamline item development. Furthermore, our study demonstrates that GPT-4o possesses a notable capacity for logical reasoning and analytical processing when evaluating incorrect answers, suggesting its potential utility in developing explanations that reinforce clinical thinking frameworks. However, ensuring the clinical validity, cognitive appropriateness, and pedagogical alignment of AI-generated content remains a critical challenge, underscoring the need for expert oversight. While this study focused on interpreting images embedded within question stems, future work could explore more complex formats in which answer options themselves are visual (e.g., electrocardiogram tracings or radiographs). These tasks require fine-grained image discrimination and multimodal reasoning, areas in which current models still face limitations. As such, advancing the visual acuity and contextual understanding of LVLMs is essential to support their integration into high-stakes assessment environments.
Beyond assessment, LVLMs like ChatGPT show promise as instructional aids in case-based teaching. As demonstrated in exploratory prompts (Appendix 1 and 2), the model can support educators in structuring lessons around specific clinical concepts and generating adaptive instructional feedback [27, 28]. With its ability to present information logically and adjust content to learner needs, ChatGPT may enhance personalized learning and curricular efficiency [5, 28, 29]. Our findings, in conjunction with prior research [30, 31], support the growing interest in AI-assisted pedagogy. Nonetheless, limitations persist. While GPT-4o exhibits high coherence and insight in reasoning tasks, errors remain—particularly in nuanced clinical contexts involving medical images. These inaccuracies underscore the importance of human oversight and expert validation to ensure instructional reliability and clinical relevance [27, 32,33,34,35]. As such tools continue to evolve, their optimal use will likely depend on integration with faculty-led review and revision mechanisms, ensuring safety, accuracy, and pedagogical value in medical education.
This study has several limitations. First, all model responses were generated via the ChatGPT web interface under default settings rather than through the OpenAI Application Programming Interface (API). Although personalization was disabled and each question was submitted in a newly initiated session to minimize memory effects, API-based deployment would allow for greater control over system parameters and eliminate potential user-specific variability. Second, each question was evaluated only once per model. This approach, consistent with prior LLM evaluation studies, minimized contextual contamination from repeated prompts—particularly relevant in session-based environments—but precluded assessment of intra-model variability. Future studies should consider multi-sample testing under controlled API conditions to examine response stability and reproducibility. Third, the relatively small number of publicly available USMLE-style questions with image content (n = 38) limited the statistical power and generalizability of the findings. Expanding the question pool and including a broader range of visual modalities—such as radiographs and electrocardiograms—would enhance benchmarking rigor. Fourth, the study did not isolate the respective contributions of visual versus textual inputs to model performance. While examples in the Supplementary Appendix suggest engagement with image content, dedicated experimental designs are needed to disentangle multimodal reasoning pathways and assess their relative influence. Finally, the study focused exclusively on GPT-4 and GPT-4o, which were the most accessible and stable vision-capable models at the time of evaluation (late 2024 to early 2025). Comparative studies involving other LVLMs, such as Gemini, LLaVA, or DeepSeek, are warranted to explore model-specific strengths and inform future applications in medical education.