With the introduction of ChatGPT, Large Language Models (LLMs) have received enormous attention in healthcare. Despite potential benefits, researchers have underscored various ethical implications. While individual instances have garnered attention, a systematic and comprehensive overview of practical applications currently researched and ethical issues connected to them is lacking. Against this background, this work maps the ethical landscape surrounding the current deployment of LLMs in medicine and healthcare through a systematic review. Electronic databases and preprint servers were queried using a comprehensive search strategy which generated 796 records. Studies were screened and extracted following a modified rapid review approach. Methodological quality was assessed using a hybrid approach. For 53 records, a meta-aggregative synthesis was performed. Four general fields of applications emerged showcasing a dynamic exploration phase. Advantages of using LLMs are attributed to their capacity in data analysis, information provisioning, support in decision-making or mitigating information loss and enhancing information accessibility. However, our study also identifies recurrent ethical concerns connected to fairness, bias, non-maleficence, transparency, and privacy. A distinctive concern is the tendency to produce harmful or convincing but inaccurate content. Calls for ethical guidance and human oversight are recurrent. We suggest that the ethical guidance debate should be reframed to focus on defining what constitutes acceptable human oversight across the spectrum of applications. This involves considering the diversity of settings, varying potentials for harm, and different acceptable thresholds for performance and certainty in healthcare. Additionally, critical inquiry is needed to evaluate the necessity and justification of LLMs’ current experimental use. 随着 ChatGPT 的引入,大型语言模型(LLMs)在医疗保健领域受到了极大的关注。尽管存在潜在的好处,研究人员也强调了各种伦理问题。虽然个别案例引起了关注,但目前研究中的实际应用及其相关伦理问题的系统性和全面概述仍然缺乏。在此背景下,本研究通过系统回顾,绘制了当前 LLMs 在医学和医疗领域部署的伦理景观。通过全面的搜索策略查询了电子数据库和预印本服务器,生成了 796 篇记录。研究通过修改后的快速审查方法筛选和提取。方法学质量使用混合方法进行评估。对 53 篇记录进行了元综合合成。出现了四个主要的应用领域,展示了动态探索阶段。使用 LLMs 的优势在于其在数据分析、信息提供、支持决策或减轻信息丢失以及增强信息可访问性方面的能力。 然而,我们的研究还指出了与公平性、偏见、不伤害、透明度和隐私相关的反复出现的伦理问题。一个特别的担忧是生成有害或看似真实但不准确的内容的倾向。呼吁伦理指导和人类监督的声音是反复出现的。我们建议,伦理指导的讨论应重新聚焦于定义在各种应用范围内的适当人类监督标准。这包括考虑不同环境的多样性、潜在危害的不同程度以及在医疗保健领域性能和确定性可接受的不同阈值。此外,还需要进行批判性研究来评估 LLMs 当前实验性使用必要性和正当性。
Large language models (LLMs) have emerged as a transformative force in artificial intelligence (AI), generating significant interest across various sectors. The 2022 launch of OpenAI’s ChatGPT demonstrated their groundbreaking capabilities, revealing the current state of development to a wide audience. Since then, public availability and scientific interest have resulted in a flood of scientific papers exploring possible areas of application ^(1){ }^{1} as well as their ethical and social implications from a practical perspective ^(2){ }^{2}. A particularly rapid adoption of LLMs is seen in medicine and healthcare, encompassing clinical, educational and research applications ^(3-9){ }^{3-9}. This development may present a case where a general-purpose technology swiftly integrates into specific domains. According to Libsey, such technologies are characterized by their potential for extensive refinement and expansion, a wide array of applications across various processes, and significant synergies with existing technologies ^(10,11){ }^{10,11}. In a brief span, a significant number of publications have investigated the potential uses of LLMs in medicine and 大型语言模型(LLMs)已成为人工智能(AI)领域的一种变革性力量,引起了各个行业的广泛关注。2022 年 OpenAI 的 ChatGPT 的推出展示了其突破性的能力,向广大公众揭示了当前的发展状态。此后,LLMs 的公开可用性和科学兴趣导致了大量的科学论文探讨其可能的应用领域以及从实际角度出发的伦理和社会影响 ^(1){ }^{1} 。特别是在医疗和健康领域,LLMs 的应用呈现出特别快速的采纳,涵盖了临床、教育和研究等多个方面 ^(2){ }^{2} 。这种发展可能呈现一种通用技术迅速融入特定领域的案例。根据利斯比的说法,这类技术的特点在于其广泛的改进和扩展潜力、在各种过程中广泛的应用范围以及与现有技术的显著协同作用 ^(3-9){ }^{3-9} 。在短短的时间内,大量研究已经探讨了 LLMs 在医疗领域的潜在用途 ^(10,11){ }^{10,11} 。
healthcare ^(12){ }^{12}, indicating a positive trajectory for the integration of medical AI. Present-day LLMs, such as ChatGPT, are considered to have a promising accuracy in clinical decision-making ^(13,14){ }^{13,14}, diagnosis ^(15){ }^{15}, symptomassessment, and triage-advice ^(16){ }^{16}. In patient-communication, it has been posited that LLMs can also generate empathetic responses ^(17){ }^{17}. LLMs specifically trained on biomedical corpora forebode even further capacities for clinical application and patient care ^(18){ }^{18} in the foreseeable future. 医疗保健 ^(12){ }^{12} ,表明了医疗 AI 整合的积极趋势。当前的 LLMs,如 ChatGPT,在临床决策 ^(13,14){ }^{13,14} 、诊断 ^(15){ }^{15} 、症状评估和分诊建议 ^(16){ }^{16} 方面被认为具有良好的准确性。在患者沟通中,有人认为 LLMs 也可以生成具有同理心的回应 ^(17){ }^{17} 。专门训练于生物医学语料库的 LLMs 预示着未来在临床应用和患者护理方面将具备更强大的能力 ^(18){ }^{18} 。
Conversely, the adoption of LLMs is entwined with ethical and social concerns ^(19){ }^{19}. In their seminal work, Bender et al. anticipated real-world harms that could arise from the deployment of LLMs^(20)\mathrm{LLMs}^{20}. Scholars have delineated potential risks across various application domains ^(21,22){ }^{21,22}. The healthcare and medical fields, being particularly sensitive and heavily regulated, is notably susceptible to ethical dilemmas. This sector is also underpinned by stringent ethical norms, professional commitments, and societal role recognition. Despite the potential benefits of employing advanced AI technology, 相反,LLMs 的采用与伦理和社会问题紧密相关 ^(19){ }^{19} 。在 Bender 等人开创性的工作中,他们预见了部署 LLMs 可能带来的实际危害 LLMs^(20)\mathrm{LLMs}^{20} 。学者们已经界定了各种应用领域中潜在的风险 ^(21,22){ }^{21,22} 。医疗和医学领域,由于其高度敏感性和严格的监管,特别容易受到伦理困境的影响。该领域还受到严格的伦理规范、专业承诺和社会角色认可的支撑。尽管先进的 AI 技术具有潜在的好处,
researchers have underscored various ethical implications associated with using LLMs in healthcare and health-related research ^(4,6,7,23-26){ }^{4,6,7,23-26}. Paramount concerns include the propensity of LLMs to disseminate inadequate information, the input of sensitive health information or patient data, which raises significant privacy issues ^(24){ }^{24}, and the perpetuation of harmful gender, cultural or racial biases ^(27-30){ }^{27-30}, well known from machine learning algorithms ^(31){ }^{31}, especially in healthcare ^(32){ }^{32}. Case reports have documented that ChatGPT has already caused actual damage, potentially life-threatening for patients ^(33){ }^{33}. 研究人员强调了在医疗和健康相关研究中使用 LLMs 所涉及的各种伦理问题 ^(4,6,7,23-26){ }^{4,6,7,23-26} 。主要担忧包括 LLMs 传播不准确信息的可能性,输入敏感的健康信息或患者数据,这引发了重大隐私问题 ^(24){ }^{24} ,以及延续有害的性别、文化或种族偏见 ^(27-30){ }^{27-30} ,这些问题在机器学习算法中早已为人所知 ^(31){ }^{31} ,尤其是在医疗领域 ^(32){ }^{32} 。案例报告已经记录了 ChatGPT 已经造成了实际损害,甚至可能对患者构成生命威胁 ^(33){ }^{33} 。
While individual instances have drawn attention to ethical concerns surrounding the use of LLMs in healthcare, there appears to be a deficit in comprehensive, systematic overviews addressing these ethical considerations. This gap is significant, given the ambitions to rapidly integrate LLMs and foundational models into healthcare systems ^(34){ }^{34}. Our intention is to bridge this lacuna by mapping out the ethical landscape surrounding the deployment of LLMs in this field. To this end, we conducted a systematic review of the current literature including relevant databases and preprint servers. Our inquiry was structured around two research questions: Firstly, we sought to delineate the ethically relevant applications, interventions, and contexts where LLMs have been tested or proposed within the realms of medicine and healthcare. Secondly, we aimed to identify the principal outcomes as well as the opportunities, risks, benefits, and potential harms associated with the use of LLMs in these sectors, as deemed significant from an ethical standpoint. Through this, we aspire not only to outline the current ethical discourse but also to inform future dialogue and policy-making at the intersection of LLMs and healthcare ethics. 尽管个别案例已经引起了人们对医疗保健领域使用 LLMs 所涉及的伦理问题的关注,但似乎缺乏全面而系统的综述来解决这些伦理考虑。鉴于希望迅速将 LLMs 和基础模型整合到医疗系统中,这一缺口尤为重要 ^(34){ }^{34} 。我们的意图是通过绘制 LLMs 在该领域部署所涉及的伦理景观来填补这一空白。为此,我们对相关文献数据库和预印本服务器进行了系统性回顾。我们的研究围绕两个研究问题展开:首先,我们试图界定在医学和医疗领域中已经测试或提议的具有伦理相关性的应用、干预措施和情境。其次,我们旨在识别这些领域中 LLMs 使用的主要结果,以及从伦理角度来看,与这些使用相关的机遇、风险、益处和潜在危害。 通过这一点,我们不仅希望概述当前的伦理讨论,还希望为 LLMs 与医疗保健伦理的交叉领域中的未来对话和政策制定提供信息。
Results 结果
Our search yielded a total of 796 database hits. After removal of duplicates, 738 records went through title/abstract screening. 158 full-texts were assessed. 53 records were included in the dataset, encompassing 23 original articles ^(25,35-56)^{25,35-56}, including theoretical or empirical work, 11 letters ^(57-67){ }^{57-67}, six 我们的搜索共获得了 796 条数据库记录。去重后,有 738 篇记录通过了标题/摘要筛选。158 篇全文进行了评估。最终有 53 篇记录被纳入数据集,其中包括 23 篇原创文章(包括理论或实证研究) ^(25,35-56)^{25,35-56} ,11 封信件 ^(57-67){ }^{57-67} ,以及六篇
editorials ^(68-73){ }^{68-73}, four reviews ^(8,74-76){ }^{8,74-76}, three comments ^(24,77,78){ }^{24,77,78}, one report ^(79){ }^{79} and five unspecified articles ^(80-84){ }^{80-84}. The flow of records through the review process can be seen in Fig. 1. Most works focus on applications utilizing ChatGPT across various healthcare fields, as indicated in Table 1. Regarding the affiliation of the first authors, 25 articles come from North America, 11 from Europe, six from West Asia, four from East asia, three from South Asia and four from Australia. 编辑 ^(68-73){ }^{68-73} 文章 4 篇 ^(8,74-76){ }^{8,74-76} ,评论 ^(24,77,78){ }^{24,77,78} 3 篇 ^(79){ }^{79} ,报告 ^(80-84){ }^{80-84} 1 篇 @5# 和其他未分类文章 5 篇 @6# 。记录通过审稿流程的流程图见图 1。大多数研究集中在 ChatGPT 在各种医疗健康领域的应用,详见表 1。关于第一作者的所属机构,25 篇文章来自北美,11 篇来自欧洲,6 篇来自西亚,4 篇来自东亚,3 篇来自南亚,4 篇来自澳大利亚。
During analysis, four general themes emerged in our dataset, which we used to structure our reporting. These themes include clinical applications, patient support applications, support of health professionals, and public health perspectives. Table 2 provides exemplary scenarios for each theme derived from the dataset. 在分析过程中,我们在数据集中发现了四个主要主题,我们使用这些主题来结构化我们的报告。这些主题包括临床应用、患者支持应用、对医务人员的支持以及公共卫生视角。表 2 提供了每个主题的示例场景,这些场景均来源于数据集。
Clinical applications 临床应用
To support initial diagnosis and triaging of patients ^(39,52){ }^{39,52}, several authors discuss the use of LLMs in the context of predictive patient analysis and risk assessment in or prior to clinical situations as a potentially transformative application ^(74,80){ }^{74,80}. The role of LLMs in this scenario is described as that of a “co-pilot” using available patient information to flag areas of concern or to predict diseases and risk factors ^(44){ }^{44}. 为了支持初步诊断和患者分流,多位作者讨论了在临床情境中或临床前使用 LLMs 进行预测性患者分析和风险评估的潜在变革性应用。在这一场景中,LLMs 的角色被描述为“副驾”,利用可用的患者信息来标记需要关注的领域或预测疾病和风险因素。
Currie, in line with most authors, notes that predicting health outcomes and relevant patterns is very likely to improve patient outcomes and contribute to patient benefit ^(80){ }^{80}. For example, overcrowded emergency departments present a serious issue worldwide and have a significant impact on patient outcomes. From a perspective of harm avoidance, using LLMs with triage notes could lead to reduced length of stay and a more efficient utilization of time in the waiting room ^(52){ }^{52}. Currie,和其他大多数作者的观点一致,认为预测健康结果和相关模式很可能改善患者结果并有助于患者受益 ^(80){ }^{80} 。例如,拥挤的急诊部门是全球性的问题,对患者结果有重大影响。从避免伤害的角度来看,使用 LLMs 进行分诊记录分析可能会减少住院时间,并更有效地利用候诊室的时间 ^(52){ }^{52} 。
All authors note, however, that such applications might also be problematic and require close human oversight ^(39,44,51,80){ }^{39,44,51,80}. Although LLMs might be able to reveal connections between disparate knowledge ^(40){ }^{40}, generating inaccurate information would have severe negative consequences ^(44,74){ }^{44,74}. 然而,所有作者都指出,此类应用也可能存在问题,并需要密切的人类监督 ^(39,44,51,80){ }^{39,44,51,80} 。尽管 LLMs 可能能够揭示不同知识之间的联系 ^(40){ }^{40} ,生成不准确的信息将产生严重的负面影响 ^(44,74){ }^{44,74} 。
Fig. 1 | Flow of records through the screening process. This Diagram following PRISMA guidelines showing the flow of records through the screening process. 图 1 | 筛查过程中的记录流程。该图遵循 PRISMA 指南,展示了记录通过筛选过程的流程。
Identification of studies via databases and preprint servers 通过数据库和预印本服务器识别研究
Table 1 | Overview of the included records 表 1 | 包含记录的概述