Papers
arxiv:2503.17599

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Published on Mar 22
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A new evaluation framework and benchmark (GPBench) assess the capabilities of large language models in general practice, revealing the need for further optimization and human oversight.

AI-generated summary

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not yet ready for deployment in such settings without human oversight, and further optimization specifically tailored to the daily responsibilities of GPs is essential.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.17599 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17599 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17599 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.