Publication
Article
The American Journal of Managed Care
Author(s):
This article presents challenges and solutions regarding health care–focused large language models (LLMs) and summarizes key recommendations from major regulatory and governance bodies for LLM development, implementation, and maintenance.
ABSTRACT
This commentary presents a summary of 8 major regulations and guidelines that have direct implications for the equitable design, implementation, and maintenance of health care–focused large language models (LLMs) deployed in the US. We grouped key equity issues for LLMs into 3 domains: (1) linguistic and cultural bias, (2) accessibility and trust, and (3) oversight and quality control. Solutions shared by these regulations and guidelines are to (1) ensure diverse representation in training data and in teams that develop artificial intelligence (AI) tools, (2) develop techniques to evaluate AI-enabled health care tool performance against real-world data, (3) ensure that AI used in health care is free of discrimination and integrates equity principles, (4) take meaningful steps to ensure access for patients with limited English proficiency, (5) apply AI tools to make workplaces more efficient and reduce administrative burdens, (6) require human oversight of AI tools used in health care delivery, and (7) ensure AI tools are safe, accessible, and beneficial while respecting privacy. There is an opportunity to prevent further embedding of existing disparities and issues in the health care system by enhancing health equity through thoughtfully designed and deployed LLMs.
Am J Manag Care. 2025;31(3):In Press
Takeaway Points
Clinicians and health care organizations face a major inflection point regarding the integration of large language models (LLMs) into health care delivery. With health care adopting more artificial intelligence tools, we risk further exacerbating inequities in health care delivery and outcomes across protected groups if we do not anticipate LLM-specific challenges. This commentary presents:
As adoption of artificial intelligence (AI) in health care increases, there is growing interest in generative AI enabled by large language models (LLMs).1-3 LLMs are algorithms that recognize, summarize, translate, and generate natural language content to perform a wide range of tasks. When integrated with electronic health records (EHRs), they can be used to streamline clinical documentation and automate responses to patients’ messages.4 Although these AI tools show promise, LLMs also exhibit inherent pitfalls that may exacerbate long-standing health care inequities.5-7 With greater adoption and deployment of LLMs and AI in clinical care on the horizon, we must anticipate these pitfalls and commit to principles of health equity by centering patient voices, experiences, and needs when designing and implementing LLM clinical tools, particularly for patients from historically marginalized backgrounds.8,9
LLM Equity Challenges
Limited studies exist on the impact of LLMs on equitable health care delivery.7,10 Important health equity–related questions include: Are LLM-based tools such as ambient AI scribes used differently based on patient characteristics? Do they positively impact care team workloads by increasing personnel time for patient care across settings? What types of errors that negatively impact quality of care are more common for patients of underrepresented backgrounds?
We grouped key equity issues for LLMs into 3 major domains: (1) linguistic and cultural bias, (2) accessibility and trust, and (3) oversight and quality control, as described below and in the Figure.
Linguistic and Cultural Bias
Limited support for non-English languages. Most clinically applied LLMs were developed primarily for English-language interactions.11 These models frequently encode other languages into English (rather than directly processing the original language) during the training or fine-tuning process, which compromises their generative performance in non-English languages.11 Further, LLMs developed for additional languages typically lack proper certification for clinical use, whereas human interpreters are tested for certification.12 Addressing these issues before widespread deployment of health care–focused LLMs may help to avoid embedding differences in accessibility for patients of different linguistic and ethnic backgrounds. It is also imperative for compliance with Office for Civil Rights requirements that AI tools not discriminate based on protected legal classes to prevent future disparities in the quality of LLMs that work for populations who speak English vs non-English languages.13
Limited ability to understand nonstandard English. LLMs are predominantly trained and fine-tuned using English texts and sources that may not be representative or include linguistic variations. Preliminary studies show that LLMs face challenges understanding regional or cultural variations (eg, English as a second language or generational slang terms).14 Health care–focused LLM performance is likely to suffer because most AI models are trained on primarily US-based data that include standard English language inputs.11 Similarly to the point above, this could result in long-lasting differences in accessibility and quality of LLMs for patients of underrepresented backgrounds.
Accessibility and Trust
Unverified utility for patients with varying abilities and skills. To our knowledge, most LLMs for medical transcription are not validated for use among patients with disabilities (particularly speech-related disabilities) or varying abilities and skills.15 Despite current limitations, there is an opportunity to create tools to improve communication and quality care standards for all patients and use LLMs to increase patients’ access to health information and care (eg, text-to-speech assistants, materials for patients with low health literacy).15
Uncertain acceptability and utility among marginalized and/or underrepresented populations. Many individuals in historically marginalized communities are skeptical of or lack trust in the health care system because of concerns about mistreatment or experiences with discrimination.16,17 Introducing new LLM-based technologies can worsen this trust gap and lead to lower patient satisfaction if furtively or inadequately deployed. Lack of adequate representation across groups in training data is also well documented, limiting how useful these models are in diverse settings.18 Alternatively, initial evidence shows that some Black patients perceive AI as a tool to potentially mitigate clinician bias and prejudice.19 To ensure equity, it is imperative that patients from diverse backgrounds have input into the training and usage of LLMs in health care. Previous research also shows it is important to maintain a sense of human connection with patients, align use of AI in health care with community and cultural values, and use AI in ways that benefit patients and improve their care from their vantage point.8
Oversight and Quality Control
Limited human oversight. With proper oversight, LLMs are powerful tools that can enhance clinician and patient experiences. Ideally, humans intervene to prevent feedback loops in which AI-generated content is used to train LLMs. LLMs often lack adequate oversight, leading to downstream effects and further embedding of errors.18,20 Automation bias, where LLM outputs are not adequately monitored/verified by users, has arisen in clinical settings where clinicians were supported by AI in reading mammography.21 Model collapse—in which LLMs gradually degrade from being trained on data containing errors generated by previous inaccurate LLMs20—may lead to errors being embedded in EHRs through AI-generated clinical notes or messages or further exacerbated by automation bias or human error.18 Of particular concern are errors stemming from LLMs learning to predict text based on biased and/or prejudiced assumptions from human-generated training data, which reflect broader, ongoing societal inequities. The risk is not theoretical; other health care algorithms have already shown how learning from historically biased patterns can perpetuate discrimination, such as when population health management algorithms resulted in Black patients receiving differential treatment due to biased historical patterns of care and resource allocation in the US.22
Lack of well-defined metrics to measure output quality. The field lacks established metrics for quality, fairness, and accuracy of outputs.23 For instance, confabulation (also commonly referred to as hallucination or delusion)24 is a common issue when LLMs generate content that is not accurately grounded in real-world data. No standardized method exists to measure confabulation in health care–focused LLMs beyond human evaluation, which can be onerous and resource-intensive. The risks increase when LLM confabulations incorporate and/or propagate societal biases such as racial prejudices, as these could further entrench discriminatory practices within the health care system.
Recommendations From Key Regulations and Guidelines
Professional, national, and international guidelines also exist to address these issues and other equity-related challenges in the US. The Table13,25-31 summarizes key regulations and guidelines for AI in health care that are applicable to LLMs, with a focus on those with direct implications for equity assessment, and notes their strengths and limitations. (All information is current as of January 19, 2025.) These guidelines and regulations seek to advance responsible and safe development and implementation of LLM tools that work well with patients of diverse linguistic and racial/ethnic/national backgrounds while respecting privacy issues and enhancing the quality and efficiency of health care delivery.13,25-31 Although these regulations and guidelines are not specific to LLMs, many include items directly relevant and applicable to LLM design, implementation, and maintenance. They share the following recommendations:
Proposed Strategies to Ensure Equity in the Era of LLMs
To translate these conceptual goals into practical strategies to mitigate these issues and preempt future health care inequities stemming from the design, implementation, and maintenance of LLMs, we propose the following.
First, health care organization leadership and administration should demand that LLM developers incorporate diverse patient voices in design, evaluation, and implementation to mitigate representation biases and build trust.8,27 When integrating diverse patient voices, ensure adequate representation of perspectives by eliciting feedback from patients across racial/ethnic groups, age groups, genders, and levels of English language proficiency (including non-English speakers), as well as patients with disabilities.
Second, LLM-based tools must be continuously evaluated throughout their life cycle. Special attention should be paid to tools purchased from outside vendors where oversight of performance metrics is opaque or not released. Creating industry standards for basic metrics to evaluate equity-related performance would provide data to understand potential impacts on care delivery and create shared targets for academic-industrial collaboration. Efforts to develop some of these metrics include a modified 9-item Physician Documentation Quality Instrument for grading summaries generated by ambient AI scribes2,32 and a scoring rubric for clinical implementation of AI algorithms primarily developed within radiology, the specialty with the most FDA-cleared AI tools.30,33,34 However, no single metric has widespread adoption, and none are tailored for LLMs specifically. There remains a need to develop standardized assessments for LLM performance across patient groups and settings.
Third, equity metrics should include measurement of LLM usage across protected identity classes (eg, gender, language) and examine whether adoption is linked to differences in downstream outcomes for patients’ health and care team workplace burden.23
Finally, potential biases may be introduced into LLMs throughout their development and life cycle as new inputs are used to train the model over time.18 Currently, there is no standardized solution that incorporates the many proposed human/technological strategies to verify data from multiple sources, including the tool itself (human editing for LLM-generated notes and summaries), outcomes metrics (patient health outcomes and quality of care), and qualitative feedback from care teams and patients.
Conclusions
Clinicians and health care organizations face a major inflection point regarding the integration of LLMs into health care delivery. As health care adopts more AI tools, we risk further exacerbating inequities in health care delivery and outcomes across protected groups if we ignore LLM-specific challenges. This goes beyond simply increasing access and usage of LLMs. It includes being transparent with patients about the abilities and usage of LLMs to enhance patient trust, especially to prevent further sociotechnical structural inequalities and build trust in the health care system. This is an opportunity to prevent exacerbating the existing disparities in the health care system by enhancing AI-related health equity through inclusive training data, development, and assessment of health care–focused LLMs. We must meet this moment by ensuring equity and fairness in the design, implementation, maintenance, and evaluation of these AI models.
Author Affiliations: Kaiser Permanente Northern California Division of Research (AAT, MER, RWG, VXL), Pleasanton, CA; Department of Diagnostic Radiology & Nuclear Medicine, University of Maryland School of Medicine (FXD), Baltimore, MD; University of Maryland Institute for Health Computing (FXD), Bethesda, MD; Department of Health, Society, & Behavior, UC Irvine Joe C. Wen School of Population & Public Health (DDP), Irvine, CA.
Source of Funding: This work was supported in part by the Association of Academic Radiology Clinical Effectiveness in Radiology Research Academic Fellowship Award. This work was also partially supported by a grant from the Johns Hopkins Mid-Atlantic Center for Cardiometabolic Health Equity (MACCHE). MACCHE is supported by the National Institute on Minority Health and Health Disparities of the National Institutes of Health (NIH) under award No. P50MD017348. This work is also supported in part by NIH award No. R35GM128672.
The content is solely the responsibility of the authors and does not necessarily represent the official views of MACCHE or the NIH. This work was also supported in part by the University of Maryland Baltimore Institute for Clinical & Translational Research K12 Award.
Author Disclosures: Drs Tierney, Reed, Grant, and Liu are employed by Kaiser Permanente, which deploys large language model–based health care tools for care delivery. Dr Doo has received grants and honoraria for lectures on health equity and artificial intelligence, although there is no link or conflict of interest with this article. Dr Doo also receives cloud credits from Microsoft Azure, Amazon Web Services, and Google Cloud Computing; however, no cloud computing was used for this review. Dr Doo is also supported by Montgomery County, Maryland, and The University of Maryland Strategic Partnership: MPowering the State, a formal collaboration between the University of Maryland, College Park, and the University of Maryland, Baltimore. Dr Payán reports no relationship or financial interest with any entity that would pose a conflict of interest with the subject matter of this article.
Authorship Information: Concept and design (AAT, FXD, VXL); analysis and interpretation of policy recommendations (DDP); drafting of the manuscript (AAT, MER, RWG, FXD, DDP, VXL); critical revision of the manuscript for important intellectual content (AAT, MER, RWG, FXD, DDP, VXL); administrative, technical, or logistic support (MER, RWG, FXD, VXL); and supervision (MER, RWG, VXL).
Address Correspondence to: Aaron A. Tierney, PhD, Kaiser Permanente Northern California Division of Research, 4480 Hacienda Dr, Pleasanton, CA 94588. Email: aaron.a.tierney@kp.org.
REFERENCES
1. Wu K, Wu E, Theodorou B, et al. Characterizing the clinical adoption of medical AI devices through U.S. insurance claims. NEJM AI. 2023;1(1). doi:10.1056/AIoa2300030
2. Tierney AA, Gayre G, Hoberman B, et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal Innov Care Deliv. 2024;5(3). doi:10.1056/cat.23.0404
3. Rao A, Pang M, Kim J, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv. Published online February 26, 2023. doi:10.1101/2023.02.21.23285886
4. Blank IA. What are large language models supposed to model? Trends Cogn Sci. 2023;27(11):987-989. doi:10.1016/j.tics.2023.08.006
5. Doo FX, Cook TS, Siegel EL, et al. Exploring the clinical translation of generative models like ChatGPT: promise and pitfalls in radiology, from patients to population health. J Am Coll Radiol. 2023;20(9):877-885. doi:10.1016/j.jacr.2023.07.007
6. Abràmoff MD, Tarver ME, Loyo-Berrios N, et al; Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group of the Collaborative Community for Ophthalmic Imaging Foundation, Washington, D.C. Considerations for addressing bias in artificial intelligence for health equity. NPJ Digit Med. 2023;6(1):170. doi:10.1038/s41746-023-00913-9
7. Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12-e22. doi:10.1016/S2589-7500(23)00225-X
8. Adams SJ, Tang R, Babyn P. Patient perspectives and priorities regarding artificial intelligence in radiology: opportunities for patient-centered radiology. J Am Coll Radiol. 2020;17(8):1034-1036. doi:10.1016/j.jacr.2020.01.007
9. Haltaufderheide J, Ranisch R. The ethics of ChatGPT in medicine and healthcare: a systematic review on large language models (LLMs). NPJ Digit Med. 2024;7(1):183. doi:10.1038/s41746-024-01157-x
10. Small WR, Wiesenfeld B, Brandfield-Harvey B, et al. Large language model-based responses to patients’ in-basket messages. JAMA Netw Open. 2024;7(7):e2422399. doi:10.1001/jamanetworkopen.2024.22399
11. Wendler C, Veselovsky V, Monea G, West R. Do llamas work in English? on the latent language of multilingual transformers. arXiv. Published online February 16, 2024. doi:10.48550/arXiv.2402.10588
12. The National Board of Certification for Medical Interpreters. Accessed October 30, 2024. https://www.certifiedmedicalinterpreters.org/
13. Office for Civil Rights, Office of the Secretary, HHS; CMS, HHS. Nondiscrimination in health programs and activities. Fed Regist. 2024;89(88):37522-37703.
14. Liang W, Yuksekgonul M, Mao Y, Wu E, Zou J. GPT detectors are biased against non-native English writers. Patterns (N Y). 2023;4(7):100779. doi:10.1016/j.patter.2023.100779
15. Gadiraju V, Kane S, Dev S, et al. “I wouldn’t say offensive but...”: disability-centered perspectives on large language models. In: FAccT ‘23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery; 2023:205-216.
16. Cuevas AG, O’Brien K, Saha S. African American experiences in healthcare: “I always feel like I’m getting skipped over.” Health Psychol. 2016;35(9):987-995. doi:10.1037/hea0000368
17. Lagu T, Haywood C, Reimold K, DeJong C, Walker Sterling R, Iezzoni LI. ‘I am not the doctor for you’: physicians’ attitudes about caring for people with disabilities. Health Aff (Millwood). 2022;41(10):1387-1395. doi:10.1377/hlthaff.2022.00475
18. Tejani AS, Ng YS, Xi Y, Rayan JC. Understanding and mitigating bias in imaging artificial intelligence. Radiographics. 2024;44(5):e230067. doi:10.1148/rg.230067
19. Lee MK, Rich K. Who is included in human perceptions of AI?: trust and perceived fairness around healthcare AI and cultural mistrust. In: CHI ’21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery; 2021:1-14.
20. Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, Gal Y. AI models collapse when trained on recursively generated data. Nature. 2024;631(8022):755-759. doi:10.1038/s41586-024-07566-y
21. Dratsch T, Chen X, Rezazade Mehrizi M, et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology. 2023;307(4):e222176. doi:10.1148/radiol.222176
22. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342
23. Gichoya JW, McCoy LG, Celi LA, Ghassemi M. Equity in essence: a call for operationalising fairness in machine learning for healthcare. BMJ Health Care Inform. 2021;28(1):e100289. doi:10.1136/bmjhci-2020-100289
24. Daneshvar N, Pandita D, Erickson S, Snyder Sulmasy L, DeCamp M; ACP Medical Informatics Committee and the Ethics, Professionalism and Human Rights Committee. Artificial intelligence in the provision of health care: an American College of Physicians policy position paper. Ann Intern Med. 2024;177(7):964-967. doi:10.7326/M24-0146
25. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. The White House. October 30, 2023. Accessed July 17, 2024. https://web.archive.org/web/20240717103722/https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
26. Adams L, Fontaine E, Lin S, Crowell T, Chung VCH, Gonzales AA. Artificial intelligence in health, health care, and biomedical science: an AI code of conduct principles and commitments discussion draft. NAM Perspect. 2024;2024:10.31478/202403a. doi:10.31478/202403a
27. Cordovano G, deBronkart D, Downing A, et al. AI Rights for Patients. Light Collective. March 22, 2024. Accessed July 17, 2024. https://lightcollective.org/wp-content/uploads/2024/03/Collective-Digital-Rights-For-Patients_v1.0.pdf
28. Harnessing artificial intelligence for health. World Health Organization. Accessed July 17, 2024.
https://www.who.int/teams/digital-health-and-innovation/harnessing-artificial-intelligence-for-health
29. CHAI assurance standards guide: AI that serves all of us. Coalition for Health AI. Accessed July 17, 2024. https://chai.org/wp-content/uploads/2024/06/CHAI_AssuranceGuide_062624.pdf
30. Brady AP, Allen B, Chong J, et al. Developing, purchasing, implementing and monitoring AI tools in radiology: practical considerations. a multi-society statement from the ACR, CAR, ESR, RANZCR & RSNA. J Am Coll Radiol. 2024;21(8):1292-1310. doi:10.1016/j.jacr.2023.12.005
31. Office of the Spokesperson. United Nations General Assembly adopts by consensus U.S.-led resolution on seizing the opportunities of safe, secure and trustworthy artificial intelligence systems for sustainable development: fact sheet. US Department of State. Accessed October 18, 2024. https://2021-2025.state.gov/united-nations-general-assembly-adopts-by-consensus-u-s-led-resolution-on-seizing-the-opportunities-of-safe-secure-and-trustworthy-artificial-intelligence-systems-for-sustainable-development/
32. Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing electronic note quality using the Physician Documentation Quality Instrument (PDQI-9). Appl Clin Inform. 2012;3(2):164-174. doi:10.4338/aci-2011-11-ra-0070
33. Daye D, Wiggins WF, Lungren MP, et al. Implementation of clinical artificial intelligence in radiology: who decides and how? Radiology. 2022;305(3):555-563. doi:10.1148/radiol.212151
34. Larson DB, Doo FX, Allen B Jr, Mongan J, Flanders AE, Wald C. Proceedings from the 2022 ACR-RSNA workshop on safety, effectiveness, reliability, and transparency in AI. J Am Coll Radiol. 2024;21(7):1119-1129. doi:10.1016/j.jacr.2024.01.024