[关键词]
[摘要]
目的:评估3种不同的大型语言模型(LLM,包括GPT-3.5、GPT-4和PaLM2)在回答眼科专业问题中的表现并与3种不同水平的专业人群(医学本科生、医学硕士、主治医师)进行比较。
方法:分别对三种不同的LLM和3种不同水平的专业人群(包括了本科生9名,专业型研究生6名,主治医师3名)进行一项由100道眼科单项选择题组成的测试,问题涵盖了眼科基础知识、临床知识、眼科检查诊断方法以及眼病相关治疗手段。从平均得分、答题稳定性和答题自信心等方面综合评估LLM的性能并与人类组进行比较。
结果:在平均测试得分中,每个LLM都在总体上优于本科生(GPT-4:56分,GPT-3.5:42分,PaLM2:47分,本科生:40分),其中GPT-3.5、PaLM2略低于硕士水平(硕士:51分),而GPT-4则表现出与主治医师相当的水平(主治医师:62分)。另外,GPT-4表现出明显高于GPT-3.5和PaLM2的答题稳定性和答题自信心。
结论:以GPT-4为代表的LLM在眼科领域表现的较为出色,LLM模型可为临床医生和医学教育进行临床决策及教学辅助。
[Key word]
[Abstract]
AIM: To evaluate the performance of three distinct large language models(LLM), including GPT-3.5, GPT-4, and PaLM2, in responding to queries within the field of ophthalmology, and to compare their performance with three different levels of medical professionals: medical undergraduates, master of medicine, and attending physicians.
METHODS: A total of 100 ophthalmic multiple-choice tests, which covered ophthalmic basic knowledge, clinical knowledge, ophthalmic examination and diagnostic methods, and treatment for ocular disease, were conducted on three different kinds of LLM and three different levels of medical professionals(9 undergraduates, 6 postgraduates and 3 attending physicians), respectively. The performance of LLM was comprehensively evaluated from the aspects of mean scores, consistency and confidence of response, and it was compared with human.
RESULTS: Notably, each LLM surpassed the average performance of undergraduate medical students(GPT-4:56, GPT-3.5:42, PaLM2:47, undergraduate students:40). Specifically, performance of GPT-3.5 and PaLM2 was slightly lower than those of master's students(51), while GPT-4 exhibited a performance comparable to attending physicians(62). Furthermore, GPT-4 showed significantly higher response consistency and self-confidence compared with GPT-3.5 and PaLM2.
CONCLUSION: LLM represented by GPT-4 performs well in the field of ophthalmology, and the LLM model can provide clinical decision-making and teaching aids for clinicians and medical education.
[中图分类号]
[基金项目]