![]() |
![]() |
AbstractSince the release of ChatGPT, large language models (LLMs) have rapidly expanded into professional domains, including medicine. These models, trained on extensive text corpora, including the medical literature, have demonstrated remarkable capabilities in tasks such as clinical decision support, research assistance, and education. This review focuses on LLM applications in otolaryngology-head and neck surgery (Ear, Nose, and Throat [ENT]). We analyzed 25 studies published between January 2022 and March 2025 in ENT journals ranked in the top 25% (Q1) according to the 2023 Journal Citation Reports. Furthermore, we categorized these studies by use case and systematically examined the models, datasets, and evaluation methods employed. Despite increasing adoption of LLMs in the ENT field, several challenges remain, including limited model diversity, inconsistent evaluation standards, and ongoing issues with accuracy and fairness. We also contextualized LLM research trends within the broader medical domain. Five key areas were identified for advancing clinical-grade LLMs: robust evaluation frameworks, external source-based generation, multimodal integration, agent-based reasoning, and model explainability. Our findings provide ENT clinicians and researchers with a practical foundation for understanding, evaluating, and implementing LLMs or their advanced successors (e.g., large multimodal models, agents) in clinical and research settings.
INTRODUCTIONSince the public release of ChatGPT by OpenAI in November 2022 [1], chat-based artificial Intelligence (AI) models have become integral to daily life. OpenAI’s ChatGPT has surpassed 400 million active weekly users [2]. These models are not only used for casual conversation and information retrieval but have also found wide application in professional fields, including code generation, data analysis, report writing, and education [3-5].
Such models, including ChatGPT [1], Gemini [6], and LLaMA [7], are collectively termed large language models (LLMs) because they are trained on massive text corpora spanning terabytes in size and consist of billions to trillions of parameters (see Table 1 for a glossary of terms). Notably, the extensive corpora used for pretraining include publicly accessible medical textbooks, research papers, and clinical documents, enabling these models to acquire substantial domain-specific medical knowledge [8].
Building on this foundation, the development of domain-specialized LLMs tailored to medicine has accelerated. While early medical LLMs, such as BioGPT [9], DRAGON [10], LinkBERT [11], BiomedBERT [12], and PubMedGPT [13], were trained on biomedical corpora using relatively small base models, recent evidence suggests that general-purpose LLMs trained on large-scale datasets outperform these specialized models [8]. According to the Hugging Face leaderboard [14], current state-of-the-art (SOTA) models in the medical domain are either general LLMs (e.g., GPT-4 and GPT-4o) or general models further refined with instruction tuning and additional medical datasets (e.g., Med-PaLM 2 [8] and OpenBioLLM-70B [15]). Fig. 1 presents a comparison of model performance on the MedQA dataset [16]. Data were obtained from the Hugging Face Open Medical LLM leaderboard [14,17], which primarily features academic models, and the MedQA leaderboard from VALS.ai [18], which showcases leading industry models.
Evaluating the clinical competence of LLMs remains a complex challenge. No single, definitive metric can capture a model’s proximity to an ideal “clinical expert.” Instead, a high-performing model must demonstrate a blend of skills, including medical comprehension, knowledge retrieval, and logical clinical reasoning [17]. Therefore, most benchmarking efforts assess accuracy across multiple-choice question datasets, such as licensing exam-style evaluations (MedQA [16], MedMCQA [19]) and domain-specific tests (PubMedQA [20], Massive Multitask Language Understanding [21], e.g., medical genetics and college biology).
Recent models have significantly outperformed average medical students on these benchmarks [22]. Moreover, attempts have been made to apply these models in clinical settings. UC San Diego Health is piloting GPT-4 integration into Epic’s MyChart to improve patient communication [23]. Other centers are testing virtual-first models in which LLMs assist with triage and diagnosis [24]. Google has also introduced the Articulate Medical Intelligence Explorer, an AI research system for diagnostic reasoning [25].
In this rapidly evolving landscape, understanding the specific utilization of LLMs within clinical specialties is essential. This study focused on otolaryngology–head and neck surgery (Ear, Nose, and Throat [ENT]), conducting a comprehensive analysis of recent developments. Specifically, we reviewed and analyzed 25 studies published between January 2022 and March 2025 in journals ranked in the top 25% (Q1) according to the 2023 Journal Citation Reports [26] in the field of otolaryngology. We categorized these studies based on LLM use cases, such as clinical decision support, medical research assistance, and educational applications, and examined the models employed, datasets used, and evaluation methods in detail.
Our analysis revealed that, although various LLMs have been applied across a broad spectrum of ENT-related tasks, several limitations persist. These include limited model diversity, inconsistent evaluation metrics, and insufficient validation with real-world clinical data. To address these gaps and guide future development, we discuss the main limitations of current LLMs and highlight ongoing research aimed at overcoming these challenges. This discussion is organized around five core dimensions critical for the advancement of clinical-grade LLMs: evaluation frameworks, factual accuracy, multimodal integration, agent-based systems, and explainability.
This study aims to provide greater clarity regarding the application of rapidly advancing medical LLMs in the field. We hope this encourages broader adoption and further development of LLM-powered tools for clinical, research, and educational contexts within otolaryngology.
OVERVIEW OF LLM DEVELOPMENTWe provide a high-level overview of the development of LLM models, with a particular focus on their application strengths. All modern LLMs are built on the Transformer architecture [27], first introduced in 2017. The core innovation of the transformer is the self-attention mechanism, which enables the model to identify the most important words in a sentence for understanding its meaning. For example, in the sentence, “She prescribed the medication because the patient was in pain.” the model recognizes that “she” refers to the doctor and that “because” signifies a causal relationship. This capacity to evaluate word importance is crucial for understanding and generating language that is contextually coherent.
Initially, the transformer architecture utilized both an encoder (to process input) and a decoder (to generate output), which proved highly effective for tasks such as translation. However, as LLMs were increasingly applied to a broader range of tasks—such as question answering—research demonstrated that a decoder-only architecture offers greater efficiency. This led to the development of models like GPT-2 [28] and GPT-3 [1], which exclusively focus on generating text that is coherent and contextually appropriate.
There are two primary types of transformer-based models commonly used: (1) encoder-only models (e.g., BERT [29]), which are trained using masked language modeling (MLM). In MLM, certain words within a sentence are masked (hidden), and the model learns to predict these missing words. This approach is suitable for tasks such as classification or sentence understanding, where the objective is to interpret meaning rather than to generate new text. (2) Decoder-only models (e.g., GPT), in contrast, are trained using next-token prediction (NTP), where the model learns to predict the next word in a sequence, one word at a time. As a result, decoder models are ideal for text generation tasks, including composing discharge summaries or answering patient queries.
After a model is trained with NTP, it does not inherently understand how to follow explicit instructions such as “summarize this paragraph” or “suggest a differential diagnosis.” To address this, a process called instruction tuning is implemented, in which the model is exposed to numerous examples of tasks paired with specific prompts and suitable responses. For example, a prompt may state, “You are a medical assistant. Summarize the following patient notes,” followed by a clinical note. Through this process, the model learns how to respond appropriately and to adopt the expected style from human users. Instruction tuning is thus essential for transforming a general-purpose language model into a practical assistant in clinical or research settings and laying the groundwork for further refinement and alignment with complex human preferences.
Reinforcement learning from human feedback (RLHF) is often employed to further refine a model’s ability to generate honest and helpful responses [30]. While instruction tuning aligns the model with the expected formats of specific tasks, RLHF goes a step further by tailoring outputs to human preferences [31]. In RLHF, human annotators rank multiple LLM-generated responses for a given prompt based on quality, relevance, and safety. This data is then used to train a reward model that predicts human preferences, which is subsequently used to fine-tune the LLM through reinforcement learning [32]. This iterative process allows the model to capture nuanced aspects of human judgment—a feature especially important in sensitive domains like ENT, where precision and patient safety are paramount [33,34].
Even LLMs trained with RLHF may encounter challenges such as hallucinations [35], in which they produce plausible-sounding but factually incorrect information [36]. To address these issues, various grounding techniques have been proposed that leverage external resources, including documents [37], images [38], and videos [39]. For instance, retrieval-augmented generation (RAG) enables LLMs to determine when supplementary information is needed and retrieve verified data from external sources, typically via vector similarity-based searches in databases [40]. Researchers are also exploring the integration of pretrained models with images and videos, allowing LLMs to utilize multiple data modalities beyond text [41,42]. These advances are especially promising for healthcare applications, where accuracy is essential.
CURRENT RESEARCH TRENDS IN THE APPLICATIONS OF LLMs IN OTOLARYNGOLOGY-HEAD AND NECK SURGERYThe integration of LLMs into ENT is accelerating rapidly, with an expanding body of research investigating their potential across a range of clinical and non-clinical domains. To better characterize the current research landscape, we reviewed studies published between January 2022 and March 2025 in journals ranked in the top 25% (Q1) in the otolaryngology category, as determined by the 2023 edition of the Journal Citation Reports [26]. Studies were identified using a fixed set of keywords: artificial intelligence, AI, GPT, ChatGPT, LLMs, and LLM. Studies that employed AI techniques without language models, or that focused solely on administrative automation in healthcare, were excluded. Based on this screening process, 34 studies were initially identified; of these, nine were excluded, yielding a final set of 25 studies for analysis. Among the selected journals, Otolaryngology-Head and Neck Surgery published the greatest number of LLM-based studies in the ENT domain, with 11 publications. Supplementary Table 1 provides a comprehensive list of the journals and articles that were included.
The selected studies were organized into three major application areas: (1) clinical decision support, (2) medical research assistance, and (3) educational applications. For each category, we examined the current applications of LLMs in ENT, identified common limitations, and highlighted their potential for supporting various specialized tasks. The relative distribution of studies in these areas is summarized in Fig. 2A. Further analysis of model types, modalities, inference settings, and methodological trends is presented in section “Future Perspectives” (Fig. 2B-I) and linked to key challenges discussed later in this paper.
Clinical decision supportLLMs are increasingly being used as clinical decision support tools in ENT, with research broadly categorizing their use as either information search engines or autonomous decision-making agents. The accuracy and readability of LLM-generated responses to clinical questions have been evaluated. For example, ChatGPT outperformed traditional web searches in providing postoperative instructions for common ENT operations [43] and in simplifying general medical information [44]. However, it continues to struggle with personalized or in-depth clinical recommendations. For instance, in the context of benign paroxysmal positional vertigo, its responses were rated lower in readability and informational quality [45]. This suggests that LLMs can be valuable for patients with limited health literacy but have limited utility as standalone medical information tools [46-50].
RAG, which integrates verified external databases, has been adopted to enhance the reliability of LLMs. In otolaryngology, for example, ChatENT—a RAG-based platform that uses indexed open-access resources—produces more accurate and context-aware responses [51]. Similarly, another chatbot grounded in International Consensus on Allergy and Rhinology: Rhinosinusitis (ICAR-RS) guidelines demonstrated improved clinical relevance [52]. While these examples highlight the benefits of leveraging expert-verified databases, most current ENT research relies on relatively small knowledge sources. This constrains our understanding of how database diversity influences RAG performance. Considering evidence from the natural language processing (NLP) field that factors such as document chunk granularity, indexing density, and retriever type significantly affect performance [53], further research is required to identify optimal database structures for medical applications.
LLMs have also been assessed as autonomous agents for clinical decision-making. Trecca et al. [54] showed that five LLMs produced judgments comparable to those of medical experts in standardized clinical cases, but limitations remained in addressing patient-centered questions and providing justifications. ChatGPT-3.5 has delivered expert-level recommendations regarding surgical indications and alternatives in ENT procedures [55-57]. However, persistent challenges include outdated knowledge [58], inadequate follow-up recommendations [59], and inaccuracies for specific indications [60]. Multimodal LLMs such as ChatGPT-4o are now being evaluated for their ability to interpret complex medical data, including diagnosing laryngeal malignancies from fiberoptic videos [61] and differential diagnosis from videolaryngostroboscopy images [62]. Despite this, these models remain primarily text-dependent and struggle with non-textual inputs, raising concerns about whether their outputs reflect genuine clinical reasoning. While LLMs may substitute for certain tasks in constrained settings [63], ongoing concerns regarding reliability and safety limit their broader adoption in real-world clinical practice [64,65].
Medical research assistanceLLMs are showing increasing promise as assistants in medical research, particularly in otolaryngology. Studies have examined their utility at various stages of the research workflow. For example, ChatGPT-4 has demonstrated effectiveness in proofreading academic manuscripts, accurately suggesting revisions for most errors, although human oversight remains essential for more challenging error types [66]. When generating references, both ChatGPT-3.5 and 4 frequently produced inaccurate or fabricated citations, highlighting the ongoing need for careful validation due to hallucination risk [67]. Beyond editing and referencing, LLMs have also been evaluated for more complex tasks. Dang and Hanba et al. [68] reported that ChatGPT-3.5 could assess the methodological quality of head and neck surgery articles, indicating potential for preliminary peer review. Additionally, ChatGPT was used to generate new research ideas on dysphagia, which experts found to be moderately well-received, despite some limitations in creativity [69]. These findings indicate that LLMs can provide valuable assistance in editing, preliminary evaluation, and ideation; however, outputs should always be critically reviewed by human researchers and not used as a replacement.
Educational applicationsLLMs are being applied in medical education for both professional development and patient education. They have also demonstrated the capacity to improve the readability and accessibility of health information, especially for those with limited health literacy. Studies have found that ChatGPT can simplify educational materials to a seventh-grade reading level without compromising accuracy [70,71]. LLMs have also performed well in patient counseling for rhinoplasty [72], head and neck cancer surgeries [73], and sinusitis management [74]. Nonetheless, inconsistencies remain: for example, some oropharyngeal cancer education materials exceeded the 11th-grade reading level [75]. Trustworthiness is also a consideration; while ENT specialists found ChatGPT’s responses to be 98.3% accurate, only 79.8% of laypersons reported trust in AI-generated information [76]. This suggests that LLM effectiveness in patient education should be evaluated not only for clarity but also for patient trust.
For medical trainee education, LLMs have been evaluated for residency and board exam preparation. ChatGPT correctly answered 57% of questions from the German Otolaryngology Society training platform, suggesting only a limited supplementary role [77]. While it produced plausible responses for a head and neck surgery question bank, its reliability was still insufficient for clinical education [78]. ChatGPT achieved a passing score of 75% on official sample questions from the Royal College of Physicians and Surgeons of Canada, although partial correctness in some answers raises concerns about over-reliance [79].
CHALLENGES IN APPLYING LLMs TO OTOLARYNGOLOGY–HEAD AND NECK SURGERY: INSIGHTS FROM RECENT RESEARCH TRENDSDespite growing interest in the application of LLMs across medical domains, their integration into clinical practice remains limited by several critical challenges. In this section, we examine the key obstacles that hinder the adoption of LLMs within clinical workflows.
A key limitation is that LLM applications in ENT have largely been evaluated in narrow methodological settings, which may restrict their generalizability to broader clinical contexts. As shown in Fig. 2B-F, most studies used zero-shot inference without retrieval mechanisms, relied on single-modality inputs, and predominantly adopted GPT-based models. Furthermore, none of the studies incorporated multiple agents or explainable AI (XAI) approaches to enhance trustworthiness. Expanding research to include multimodal inputs, RAG, external knowledge sources, XAI, and alternative model architectures could produce insights that are more representative of real-world clinical complexity.
Another significant limitation is the lack of standardized evaluation methods. Approximately 78% of the studies reviewed relied primarily on human evaluation (Fig. 2G). Nearly 80% of studies used fewer than 100 clinical cases or questions, which limits the statistical robustness of their findings (Fig. 2H). In addition, most studies used custom datasets rather than standardized, certified datasets (such as board exams), further undermining the generalizability of results (Fig. 2I). These limitations underscore the urgent need for more rigorous, large-scale, and standardized evaluation frameworks to validate the clinical utility of LLMs in ENT.
A third major challenge concerns hallucinations and the risk of outdated knowledge. LLMs are known to occasionally generate factually incorrect or fabricated responses—so-called hallucinations—even when prompted with medically relevant queries. Moreover, unless LLMs are routinely retrained on the latest medical literature or integrated with dynamic external knowledge sources, they risk producing responses based on outdated guidelines or obsolete clinical practices. This temporal staleness further compromises their reliability, particularly in rapidly evolving fields such as otolaryngology.
Fourth, most existing LLMs lack the capacity for comprehensive multimodal reasoning, which is often necessary in real-world clinical scenarios. Clinicians regularly synthesize information from diverse sources—structured lab data, radiologic images, free-text notes, audio recordings, and procedural videos. However, most LLMs currently used in research are unimodal and text-based, making them inadequate for integrated clinical decision-making. While recent advances have introduced multimodal architectures, their application to complex otolaryngologic workflows remains limited. Consequently, LLMs struggle to replicate the diagnostic reasoning and contextual understanding required in real-world cases.
Fifth, there remains a considerable gap between the reasoning abilities of LLMs and the cognitive demands of clinical practice. Clinical reasoning is highly nuanced, context-dependent, and often involves counterfactual thinking, longitudinal analysis, and tacit knowledge—areas in which current LLMs still underperform. For instance, although LLMs can retrieve and summarize textbook knowledge or answer straightforward factual questions, they frequently struggle with diagnostic ambiguity and reconciling conflicting information over time. This shortfall limits their usefulness in high-stakes clinical decision-making, where subtle judgment and temporal reasoning are essential.
Sixth, data privacy continues to be a major concern, particularly since most LLMs in the ENT domain are accessed via online application programming interfaces (APIs) such as ChatGPT. This raises important questions about the transmission of sensitive medical information to external servers. As LLMs are increasingly applied to patient data analysis, robust anonymization, privacy protection, and API security protocols will be essential and warrant further research.
Finally, issues of transparency and fairness are critical and cannot be overlooked. The decision-making processes of LLMs are largely opaque, making it challenging for clinicians to interpret or justify the outputs, an essential requirement for accountability in clinical practice. Additionally, because these models are trained on large-scale datasets that may contain demographic biases, there is a risk that they may inadvertently perpetuate or exacerbate existing healthcare disparities. Without mechanisms for fairness auditing and interpretability, the widespread clinical adoption of LLMs may reinforce inequities rather than mitigate them.
FUTURE PERSPECTIVESLLMs present promising opportunities to advance clinical practice, research, and education within the field of ENT. However, their clinical adoption remains in its infancy. To guide the responsible evolution of this rapidly advancing field, this section outlines future research directions that may bridge the gap between current capabilities and meaningful clinical impact. Major challenges and strategic directions are summarized in Table 2.
Developing clinically representative datasets and fine-tuned models with reliable evaluation frameworksThe development of clinically representative datasets and robust evaluation frameworks is fundamental for the safe integration of LLMs into otolaryngology. Although recent efforts have produced patient-specific and multimodal datasets (Table 3), most remain limited in scale and diversity, with fewer than ten thousand text-only QA pairs [16,18,20,21,80-85]. Additionally, current benchmark datasets for medical LLMs tend to evaluate overall performance within the broader medical domain but lack fine-grained labeling for specific subspecialties. As a result, detailed performance analyses for these sub-specialties have rarely been performed. For example, although the MedMCQA dataset includes fine-grained specialty labels, and BiomedBERT achieved 47% accuracy specifically for ENT questions [20], most recent LLM studies have not reported subspecialty-level performance, highlighting a clear need for improvement in this area.
Notably, most current LLM applications in ENT, as illustrated in Fig. 2B and F, predominantly use zero-shot settings with general-purpose models such as GPT. This highlights a field-level limitation: the underdevelopment of fine-tuned or domain-specific models tailored to the complexities of ENT, likely owing to the scarcity of large, diverse domain-specific datasets needed for effective fine-tuning or adaptation.
To address these limitations, future research should prioritize the development of datasets that accurately reflect real-world clinical complexities, including longitudinal data, multimodal inputs, and specialty-specific nuances. These datasets can then be used to fine-tune domain-specific LLMs, moving beyond general models to versions specialized for otolaryngology knowledge and tasks. There is also an increasing need for standardized evaluation frameworks that go beyond superficial metrics such as accuracy, and instead assess clinically meaningful dimensions, including communication skills, transparency, and adherence to guidelines. Incorporating certified resources, such as board examination items, structured diagnostic pathways, and expert-validated clinical summaries, can further enhance objectivity and reproducibility. Collectively, these efforts are essential for building accurate and clinically trustworthy LLM systems aligned with real-world decision-making.
RAG for reducing hallucinations and providing up-to-date knowledgeTo address hallucinations in current LLM applications, recent research has investigated RAG as a promising strategy to improve clinical predictions [51,52]. RAG combines the generative capabilities of LLMs with the precision of information retrieval, dynamically sourcing relevant external knowledge to ground model outputs in factual, domain-specific information. Unlike static, parametric models, RAG systems can flexibly integrate multimodal or textual data from structured databases and biomedical literature.
Several recent studies have demonstrated the utility of RAG in clinical settings. For example, Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (RAM-EHR) augments local EHR representations and employs consistency regularization to better capture complementary information from external sources [86]. Personalized Graph-based HealthCare Prediction (GRAPHCARE) and Knowledge Aware Reasoning-Enhanced Healthcare Prediction (KARE) extend this approach by building personalized or community-level knowledge graphs from biomedical sources, enriching patient contexts with structured external knowledge for more accurate and interpretable predictions [87,88]. ClinicAl Reasoning-Enhanced Representation (CARER) further advances the field by aligning local patient data with global LLM-derived clinical reasoning, closely mimicking the contextualization performed by human clinicians [89].
Large multimodal models designed to process diverse types of medical dataLarge multimodal models (LMMs) process a variety of data types, including images, videos, audio, and physiological signals. This is a critical advancement for medicine, where data extend far beyond text. Integrating these modalities allows models to more closely emulate the analytic capabilities of human experts. With the advent of vision-capable models such as GPT-4V, current commercial LMMs are now able to accept multimodal input. These models, trained on extensive datasets of text, images, and videos, generalize across multiple domains, including pathology, dermatology, ophthalmology, and radiology [90-93]. Evidence suggests that generalist multimodal LLMs often outperform specialist models trained solely on medical images [94], indicating the effectiveness of fine-tuning generalist models with specific medical data.
In practice, multimodal LLMs are increasingly utilized for a range of medical tasks. Examples include analyzing rare diseases [95], leveraging EHRs for clinical predictions [96], and performing visual question answering in computational pathology [97]. In ENT, research has explored the use of LLMs to analyze laryngostroboscopy images and videos for malignancy detection and differential diagnosis [61,62].
Despite these advances, hallucinations remain a concern, even for advanced models like GPT-4o, which may misinterpret visual features. To mitigate this, strategies such as post-hoc correction with Large Vision-Language Model Hallucination Revisor (LURE) [98], visual contrastive decoding [99] to reduce bias by comparing outputs from clean and distorted images, and Over-trust Penalty and a Retrospection-Allocation strategy (OPERA) [100] to penalize overconfident tokens, are being explored. Incorporating these techniques is vital to improve the reliability of multimodal LLMs for clinical use, where diagnostic precision is critical. Ongoing refinement is essential to enable safe, effective deployment in real-world healthcare settings.
Agent systems beyond LLMsWhile definitions of an “LLM agent” vary across domains [90,91], from a practitioner’s perspective, if conventional LLMs are regarded as knowledgeable chatbots, LLM agents are systems capable of autonomously executing tasks within a given environment. LLM agents are characterized by their ability to dynamically direct their own processes and utilize external tools as needed, maintaining control over the completion of a given task [101]. Consider a clinical setting where such an agent is implemented within a hospital’s EHR system. In this scenario, the hospital’s EHR acts as the environment in which the agent operates. The agent can access a wide array of patient information stored in the system (serving as its “sensor”) and can act in this environment—for example, by placing medication orders or scheduling consultations—through predefined interfaces or tools.
When a practitioner enters a command such as “recommend an appropriate intervention for this patient,” the agent exhibits autonomy by decomposing the task into a series of subtasks through planning. For instance, it may review the patient’s radiological and laboratory findings. Based on this information, the agent determines whether surgical intervention is necessary and, if so, proposes an appropriate timeline. This type of task-oriented autonomy is increasingly implemented using multi-agent frameworks [102], wherein multiple specialized LLM agents collaborate to fulfill distinct roles. For example, one agent may interpret imaging data, another may analyze laboratory results, and a third may handle surgical scheduling. These agents communicate and coordinate with each other to complete the clinical workflow, reflecting a modular and scalable approach to agent-based system design in healthcare [103].
XAI for trustworthy and fair decision makingXAI provides interpretable representations of AI decision-making for humans, which is essential in clinical contexts where decisions have serious consequences and require ethical justification. Clinicians need to understand not only what a model predicts, but also why. Without transparent reasoning, even highly accurate models may struggle to earn trust and achieve widespread adoption. As LLMs become increasingly utilized in healthcare, aligning their reasoning with human clinical logic is a key research objective.
Recent advancements in XAI have enhanced the transparency of LLM-powered clinical decision support. Hong et al. [104] introduced ArgMed-Agents, a multi-agent system that simulates structured clinical discussions, yielding stepwise justifications that mirror human reasoning. Similarly, Ravichandran et al. [105] developed a multimodal framework using XAI to disentangle the contributions of different data sources. These systems provide clinicians with interpretable, auditable rationales, acting as transparent second opinions and fostering trust in AI-assisted care by aligning with clinical reasoning and safety expectations.
CONCLUSIONThe rapid advancement of artificial intelligence is driving significant transformation across many fields, and healthcare is no exception. In this article, we provided a comprehensive guide for healthcare professionals interested in adopting LLMs in practice, covering how LLMs are developed, which models are commonly used in clinical and research contexts, and how to select appropriate models for specific needs. We also quantitatively analyzed published studies applying LLMs within the field of ENT, elucidating the current state of adoption, existing gaps, and areas for improvement. Additionally, we explored recent trends in LLM research across the broader medical field, highlighting ongoing efforts to integrate these models into real-world clinical workflows. By outlining both the current limitations of LLMs and emerging research strategies to address them, this review provides a foundation for incorporating these insights into future medical AI research. Through this article, we aim to help healthcare professionals better understand ongoing developments in LLM-based research within healthcare and ENT, and to support the application of LLMs or LMMs in their future research endeavors.
HIGHLIGHTS▪ In otolaryngology–head and neck surgery, large language models (LLMs) have demonstrated promising results across decision support, research assistance, and education.
▪ Despite their impressive capabilities, LLMs continue to suffer from narrow research scope, unreliable evaluation methods, limitations in accuracy and reasoning, and ongoing concerns regarding fairness.
▪ Techniques such as retrieval-augmented generation (RAG), multimodal learning, LLM agents, and explainable AI are under active investigation to address these challenges.
CONFLICTS OF INTERESTMunyoung Chang is an editorial board member of the journal but was not involved in the peer reviewer selection, evaluation, or decision process of this article. No other potential conflicts of interest relevant to this article were reported. ACKNOWLEDGMENTS This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under a Leading Generative AI Human Resources Development (IITP-2025-RS-2024-00397085) grant funded by the Korean government (MSIT), a grant from the MD-PhD/Medical Scientist Training Program through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea, National Research Foundation of Korea (NRF) grants funded by the Korean government (Ministry of Science and ICT, MSIT) (2022R1A3B1077720), an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (NO.RS-2021-II211343, Artificial Intelligence Graduate School Program [Seoul National University]), and the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University. AUTHOR CONTRIBUTIONS Conceptualization: JA, BGK, MC, SY. Methodology: JA, BGK, MC, SY. Validation: JA, BGK, MC, SY. Formal analysis: JA, BGK, MC, SY. Investigation: JA, BGK, MC, SY. Data curation: JA, BGK, MC, SY. Writing–original draft: JA, BGK, MC, SY. Writing–review & editing: JA, BGK, MC, SY. All authors read and agreed to the published version of the manuscript. SUPPLEMENTARY MATERIALSSupplementary materials can be found online at https://doi.org/10.21053/ceo.2025-00121.
Supplementary Table 1.List of otolaryngology studies included in the literature search (JCR 2023) Fig. 1.Temporal trend illustrating the rapid performance improvement of major large language models (LLMs) on the MedQA benchmark. Newly released server-scale models now achieve accuracy rates exceeding 90%, while mobile- and desktop-scale models surpass the United States Medical Licensing Examination (USMLE) pass threshold. Unspecified results reflect zero-shot performance. Data are derived from the Hugging Face Open Medical LLM Leaderboard [17,18] and the MedQA leaderboard from VALS.ai [19]. CoT, chain-of-thought prompting; SC, self-consistency decoding; n-shot, number of in-context examples. Fig. 2.Distribution of 25 large language model (LLM)-based studies published between January 2022 and March 2025 in journals ranked in the top 25% of the Otolaryngology-Head and Neck Surgery category, according to the 2023 Journal Citation Reports (Q1). This figure summarizes key study characteristics related to LLM application in the field. (A) Distribution of studies by research area. (B) Inference settings, categorized as zero-shot or retrieval-augmented generation (RAG). (C) Modalities employed in each study. (D) Number of agents, representing the independent LLMs used for task execution. (E) Use of explainable artificial Intelligence (AI) techniques to increase trustworthiness. (F) LLMs utilized. (G) Evaluation methodologies applied. (H) Sample sizes, denoting the number of clinical cases or questions assessed. (I) Type of dataset used: certified (e.g., board exams) vs. custom-built by study authors. Table 1.Definitions of common terms in NLP Table 2.Limitations of current LLMs and measures to address them Table 3.Benchmark datasets developed for LLMs
LLM, large language model; AIIMS, All India Institute of Medical Sciences; NEET, National Eligibility cum Entrance Test; MIMIC-IV, Medical Information Mart for Intensive Care; ICU, intensive care unit; NLP, natural language processing; CT, computed tomography; MRI, magnetic resonance imaging; VQA, Visual Question Answering. REFERENCES1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-901.
2. Reuters. OpenAI’s weekly active users surpass 400 million [Internet]. Reuters; 2025 [cited 2025 Jun 30]. Available from: https://www.reuters.com/technology/artificial-intelligence/openais-weekly-active-users-surpass-400-million-2025-02-20/.
3. Gu Q. LLM-based code generation method for Golang compiler testing. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE); 2023 Sep 11-15; San Francisco, CA, USA. Association for Computing Machinery; 2023. p. 2201-3.
4. de Miranda BA, Campelo CE. How effective is an LLM-based data analysis automation tool? A case study with ChatGPT’s data analyst. In: Proceedings of the 39th Simpósio Brasileiro de Banco de Dados (SBBD); 2024 Sep; Florianópolis, Brazil. Sociedade Brasileira de Computação; 2024. p. 287-99.
5. Salminen J, Jung S, Medina J, Aldous K, Azem J, Akhtar W, et al. Using Cipherbot: an exploratory analysis of student interaction with an LLM-based educational chatbot. In: Proceedings of the 11th ACM Conference on Learning@Scale; 2024 Jun 18-20; Atlanta, GA, USA. Association for Computing Machinery; 2024. p. 279-83.
6. Georgiev P, Lei VI, Burnell R, Bai L, Gulati A, Tanzer G, et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2403.05530.
7. Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The Llama 3 herd of models. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2407.21783.
8. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025 Mar;31(3):943-50.
9. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022 Nov;23(6):bbac409.
10. Yasunaga M, Bosselut A, Ren H, Zhang X, Manning CD, Liang P, et al. Deep bidirectional language-knowledge graph pretraining. Adv Neural Inf Process Syst. 2022;35:37309-23.
11. Yasunaga M, Leskovec J, Liang P. Linkbert: pretraining language models with document links. arXiv [Preprint]. 2022;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2203.15827.
12. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021 Oct;3(1):1-23.
13. Yasunaga M, Lee T, Liang P. Stanford CRFM introduces PubMedGPT 2.7B. Stanford Institute for Human-Centered AI (HAI); 2022.
14. Pal A, Minervini P, Fourrier C. The open medical-LLM leaderboard: benchmarking large language models in healthcare. Hugging Face; 2024.
15. Saama’s AI Research Lab. Introducing OpenBioLLM-Llama3-70B & 8B: Saama’s AI research lab released the most openly available medical-domain LLMs to date! Saama; 2024.
16. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021 Jul;11(14):6421.
17. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172-80.
18. VALS. VALS MedQA benchmark (2025-01-30). VALS; 2025.
19. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Flores G, Chen GH, Pollard T, Ho JC, Naumann T, editors. Proceedings of the Conference on Health, Inference, and Learning. 2022 Apr 8; Virtual Event. PMLR; 2022. p. 248-60.
20. Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. PubMedQA: a dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3-7; Hong Kong, China. Association for Computational Linguistics; 2019. p. 2567-77.
21. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv [Preprint]. 2021;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2009.03300.
22. Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, et al. ChatGPT-4 Omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ. 2024 Nov;10:e63430.
23. UC San Diego. Introducing Dr. Chatbot: UC San Diego health pilots GPT-4 in Epic’s MyChart [Internet]. UC San Diego Today; 2023 [cited 2025 Jun 30]. Available from: https://today.ucsd.edu/story/introducing-dr-chatbot.
24. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model. medRxiv [Preprint]. 2023;[cited 2025 Jun 30]. Available from: https://doi.org/10.1101/2023.01.30.23285067.
25. Tu T, Palepu A, Schaekermann M, Saab K, Freyberg J, Tanno R, et al. Towards conversational diagnostic AI. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2401.05654.
26. Clarivate Analytics. Journal citation reports: otolaryngology category. 2023 ed. Clarivate; 2023.
27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, editors. Advances in neural information processing systems. Curran Associates Inc; 2017. p. 5998-6008.
28. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI. 2019;1(8):9.
29. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: In: Burstein J, Doran C, Solorio T, editors. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1 (long and short papers); 2019 Jun 2-7; Minneapolis, MN, USA. Association for Computational Linguistics; 2019. p. 4171-86.
30. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730-44.
31. Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. Adv Neural Inf Process Syst. 2017;30:4302-10.
32. Stiennon N, Ouyang L, Wu J, Ziegler DM, Lowe R, Voss C, et al. Learning to summarize from human feedback. Adv Neural Inf Process Syst. 2020;33:3008-21.
33. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2303.08774.
34. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv [Preprint]. 2023;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2307.09288.
35. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst. 2025 Jan;43(2):1-55.
36. Zhang Z, Rossi RA, Kveton B, Shao Y, Yang D, Zamani H, et al. Personalization of large language models: a survey. Transactions on Machine Learning Research; 2025.
37. Ram O, Levine Y, Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, et al. In-context retrieval-augmented language models. Trans Assoc Comput Linguist. 2023 Nov;11:1316-31.
38. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. In: Liu H, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in neural information processing systems. Vol 36 (NeurIPS 2023). Curran Associates Inc.; 2023. p. 34892-916.
39. Ren X, Xu L, Xia L, Wang S, Yin D, Huang C, et al. VideoRAG: retrieval-augmented generation with extreme long-context videos. arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2502.01549.
40. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in Neural Information Processing Systems. Vol 33. Curran Associates Inc.; 2020. p. 9459-74.
41. Li M, Miao S, Li P. Simple is effective: the roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation. arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2410.20724.
42. Ma S, Xu C, Jiang X, Li M, Qu H, Yang C, et al. Think-on-Graph 2.0: deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation. arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2407.10805.
43. Ayoub NF, Lee YJ, Grimm D, Balakrishnan K. Comparison between ChatGPT and google search as sources of postoperative patient instructions. JAMA Otolaryngol Head Neck Surg. 2023 Jun;149(6):556-8.
44. Ayoub NF, Lee YJ, Grimm D, Divi V. Head-to-head comparison of ChatGPT versus google search for medical knowledge acquisition. Otolaryngol Head Neck Surg. 2024 Jun;170(6):1484-91.
45. Bellinger JR, De La Chapa JS, Kwak MW, Ramos GA, Morrison D, Kesser BW. BPPV information on Google versus AI (ChatGPT). Otolaryngol Head Neck Surg. 2024 Jun;170(6):1504-11.
46. Colasacco CJ, Born HL. A case of artificial intelligence chatbot hallucination. JAMA Otolaryngol Head Neck Surg. 2024 Jun;150(6):457-8.
47. Gudis DA, McCoul ED, Marino MJ, Patel ZM. Avoiding bias in artificial intelligence. Int Forum Allergy Rhinol. 2023 Mar;13(3):193-5.
48. Shen SA, Perez-Heydrich CA, Xie DX, Nellis JC. ChatGPT vs. web search for patient questions: what does ChatGPT do better? Eur Arch Otorhinolaryngol. 2024 Jun;281(6):3219-25.
49. Yoshiyasu Y, Wu F, Dhanda AK, Gorelik D, Takashima M, Ahmed OG. GPT-4 accuracy and completeness against international consensus statement on allergy and rhinology: rhinosinusitis. Int Forum Allergy Rhinol. 2023 Dec;13(12):2231-4.
50. Bellinger JR, Kwak MW, Ramos GA, Mella JS, Mattos JL. Quantitative comparison of chatbots on common rhinology pathologies. Laryngoscope. 2024 Oct;134(10):4225-31.
51. Long C, Subburam D, Lowe K, Dos Santos A, Zhang J, Hwang S, et al. ChatENT: augmented large language model for expert knowledge retrieval in otolaryngology-head and neck surgery. Otolaryngol Head Neck Surg. 2024 Oct;171(4):1042-51.
52. Workman AD, Rathi VK, Lerner DK, Palmer JN, Adappa ND, Cohen NA. Utility of a LangChain and OpenAI GPT-powered chatbot based on the international consensus statement on allergy and rhinology: rhinosinusitis. Int Forum Allergy Rhinol. 2024 Jun;14(6):1101-9.
53. Ammar A, Koubaa A, Nacar O, Boulila W. Optimizing retrieval-augmented generation: analysis of hyperparameter impact on performance and efficiency. arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2505.08445.
54. Trecca EM, Caponio VC, Turri-Zanoni M, di Lullo AM, Gaffuri M, Lechien JR, et al. Comparative analysis of information quality in pediatric otorhinolaryngology: clinicians, residents, and large language models. Otolaryngol Head Neck Surg. 2025 Jul;173(1):228-36.
55. Langlie J, Kamrava B, Pasick LJ, Mei C, Hoffer ME. Artificial intelligence and ChatGPT: an otolaryngology patient’s ally or foe? Am J Otolaryngol. 2024 May-Jun;45(3):104220.
56. Moise A, Centomo-Bozzo A, Orishchak O, Alnoury MK, Daniel SJ. Can ChatGPT guide parents on tympanostomy tube insertion? Children (Basel). 2023 Sep;10(10):1634.
57. Radulesco T, Saibene AM, Michel J, Vaira LA, Lechien JR. ChatGPT-4 performance in rhinology: a clinical case series. Int Forum Allergy Rhinol. 2024 Jun;14(6):1123-30.
58. Ye F, Zhang H, Luo X, Wu T, Yang Q, Shi Z. Evaluating ChatGPT’s performance in answering questions about allergic rhinitis and chronic rhinosinusitis. Otolaryngol Head Neck Surg. 2024 Aug;171(2):571-7.
59. Lechien JR, Naunheim MR, Maniaci A, Radulesco T, Saibene AM, Chiesa-Estomba CM, et al. Performance and consistency of ChatGPT-4 versus otolaryngologists: a clinical case series. Otolaryngol Head Neck Surg. 2024 Jun;170(6):1519-26.
60. Dronkers EA, Geneid A, Al Yaghchi C, Lechien JR. Evaluating the potential of AI chatbots in treatment decision-making for acquired bilateral vocal fold paralysis in adults. J Voice. 2025 Jul;39(4):871-81.
61. Chiesa-Estomba CM, Andueza-Guembe M, Maniaci A, Mayo-Yanez M, Betances-Reinoso F, Vaira LA. Accuracy of ChatGPT-4o in text and video analysis of laryngeal malignant and premalignant diseases. J Voice. 2025 Mar;S0892-1997(25):00100-6.
62. Maniaci A, Chiesa-Estomba CM, Lechien JR. ChatGPT-4 consistency in interpreting laryngeal clinical images of common lesions and disorders. Otolaryngol Head Neck Surg. 2024 Oct;171(4):1106-13.
63. Chiesa-Estomba CM, Speth MM, Mayo-Yanez M, Liu DT, Maniaci A, Borsetto D. Is the evolving role of artificial intelligence and chatbots in the field of otolaryngology embracing the future? Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2179-80.
64. Chiesa-Estomba CM, Lechien JR, Vaira LA, Brunet A, Cammaroto G, Mayo-Yanez M, et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2081-6.
65. Vaira LA, Lechien JR, Abbate V, Allevi F, Audino G, Beltramini GA, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: a multicenter collaborative analysis. Otolaryngol Head Neck Surg. 2024 Jun;170(6):1492-503.
66. Lechien JR, Gorton A, Robertson J, Vaira LA. Is ChatGPT-4 accurate in proofread a manuscript in otolaryngology-head and neck surgery? Otolaryngol Head Neck Surg. 2024 Jun;170(6):1527-30.
67. Lechien JR, Briganti G, Vaira LA. Accuracy of ChatGPT-3.5 and -4 in providing scientific references in otolaryngology-head and neck surgery. Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2159-65.
68. Dang R, Hanba C. A large language model’s assessment of methodology reporting in head and neck surgery. Am J Otolaryngol. 2024 Mar-Apr;45(2):104145.
69. Nachalon Y, Broer M, Nativ-Zeltzer N. Using ChatGPT to generate research ideas in dysphagia: a pilot study. Dysphagia. 2024 Jun;39(3):407-11.
70. Patel EA, Fleischer L, Filip P, Eggerstedt M, Hutz M, Michaelides E, et al. The use of artificial intelligence to improve readability of otolaryngology patient education materials. Otolaryngol Head Neck Surg. 2024 Aug;171(2):603-8.
71. Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing health literacy: evaluating the readability of patient handouts revised by ChatGPT’s large language model. Otolaryngol Head Neck Surg. 2024 Dec;171(6):1751-7.
72. Capelleras M, Soto-Galindo GA, Cruellas M, Apaydin F. ChatGPT and rhinoplasty recovery: an exploration of AI’s role in postoperative guidance. Facial Plast Surg. 2024 Oct;40(5):628-31.
73. Lee JC, Hamill CS, Shnayder Y, Buczek E, Kakarala K, Bur AM. Exploring the role of artificial intelligence chatbots in preoperative counseling for head and neck cancer surgery. Laryngoscope. 2024 Jun;134(6):2757-61.
74. Hill GS, Fischer JL, Watson NL, Riley CA, Tolisano AM. Assessing the quality of artificial intelligence-generated patient counseling for rhinosinusitis. Int Forum Allergy Rhinol. 2024 Oct;14(10):1634-7.
75. Davis RJ, Ayo-Ajibola O, Lin ME, Swanson MS, Chambers TN, Kwon DI, et al. Evaluation of oropharyngeal cancer information from revolutionary artificial intelligence chatbot. Laryngoscope. 2024 May;134(5):2252-7.
76. Zalzal HG, Abraham A, Cheng J, Shah RK. Can ChatGPT help patients answer their otolaryngology questions? Laryngoscope Investig Otolaryngol. 2023 Dec;9(1):e1193.
77. Hoch CC, Wollenberg B, Luers JC, Knoedler S, Knoedler L, Frank K, et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-8.
78. Mahajan AP, Shabet CL, Smith J, Rudy SF, Kupfer RA, Bohm LA. Assessment of artificial intelligence performance on the otolaryngology residency in-service exam. OTO Open. 2023 Nov;7(4):e98.
79. Long C, Lowe K, Zhang J, Santos AD, Alanazi A, O’Brien D, et al. A novel evaluation model for assessing ChatGPT on otolaryngology-head and neck surgery certification examinations: performance study. JMIR Med Educ. 2024 Jan;10:e49970.
80. Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023 Jan;10(1):1.
81. Abacha AB, Agichtein E, Pinter Y, Demner‑Fushman D. Overview of the medical question answering task at TREC 2017 LiveQA. In: Proceedings of the 26th Text REtrieval Conference (TREC 2017); 2017 Nov 15-17; Gaithersburg, MD, USA. National Institute of Standards and Technology; 2017.
82. Pampari A, Raghavan P, Liang J, Peng J. emrQA: a large corpus for question answering on electronic medical records. In: In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018 Oct 31-Nov 4; Brussels, Belgium. Association for Computational Linguistics; 2018. p. 2357-68.
83. Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018 Nov;5:180251.
84. He X, Zhang Y, Mou L, Xing E, Xie P. PathVQA: 30000+ questions for medical visual question answering. arXiv [Preprint]. 2020;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2003.10286.
85. Gupta D, Attal K, Demner-Fushman D. A dataset for medical instructional video classification and question answering. Sci Data. 2023 Mar;10(1):158.
86. Xu R, Shi W, Yu Y, Zhuang Y, Jin B, Wang MD, et al. RAM‑EHR: retrieval augmentation meets clinical predictions on electronic health records. In: In: Ku LW, Martins A, Srikumar V, editors. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2024 Aug 11-16; Bangkok, Thailand. Association for Computational Linguistics; 2024. p. 754-65.
87. Jiang P, Xiao C, Cross A, Sun J. GraphCare: enhancing healthcare predictions with personalized knowledge graphs. In: Proceedings of the 12th International Conference on Learning Representations (ICLR); 2024 May 7-11; Vienna, Austria. OpenReview; 2024. p. 1-12.
88. Jiang P, Xiao C, Jiang M, Bhatia P, Kass Hout T, Sun J, et al. Reasoning enhanced healthcare predictions with knowledge graph community retrieval. In: Proceedings of the 13th International Conference on Learning Representations (ICLR); 2025 Jan 22-26; Kigali, Rwanda. OpenReview; 2025.
89. Nguyen TD, Huynh TT, Phan MH, Nguyen QV, Nguyen PL. CARER - ClinicAl Reasoning Enhanced Representation for Temporal Health Risk Prediction. In: In: Al Onaizan Y, Bansal M, Chen Y N, editors. In: 2024 Conference on Empirical Methods in Natural Language Processing; 2024 Nov 12-16; Miami, FL, USA. Association for Computational Linguistics; 2024. p. 10392-407.
90. Alzaid E, Pergola G, Evans H, Snead D, Minhas F. Large multimodal model-based standardisation of pathology reports with confidence and its prognostic significance. J Pathol Clin Res. 2024 Nov;10(6):e70010.
91. Cirone K, Akrout M, Abid L, Oakley A. Assessing the utility of multimodal large language models (GPT-4 Vision and Large Language and Vision Assistant) in identifying melanoma across different skin tones. JMIR Dermatol. 2024 Mar;7:e55508.
92. Qin Z, Yin Y, Campbell D, Wu X, Zou K, Tham YC, et al. large multimodal ophthalmology dataset and benchmark for large vision-language models. arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2410.01620.
93. Shen Y, Xu Y, Ma J, Rui W, Zhao C, Heacock L, et al. Multi-modal large language models in radiology: principles, applications, and potential. Abdom Radiol (NY). 2025 Jun;50(6):2745-57.
94. Han T, Adams LC, Nebelung S, Kather JN, Bressem KK, Truhn D. Multimodal large language models are generalist medical image interpreters. medRxiv [Preprint]. 2023;[cited 2025 Jun 30]. Available from: https://doi.org/10.1101/2023.12.21.23300146.
95. Liu F, Zhu T, Wu X, Yang B, You C, Wang C, et al. A medical multimodal large language model for future pandemics. NPJ Digit Med. 2023 Dec;6(1):226.
96. Zhu Y, Ren C, Wang Z, Zheng X, Xie S, Feng J, et al. EMERGE: enhancing multimodal electronic health records predictive modeling with retrieval-augmented generation. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; 2024 Oct 21-25; Boise, ID, USA. Association for Computing Machinery; 2024. p. 3549-59.
97. Lu MY, Chen B, Williamson DF, Chen RJ, Liang I, Ding T, et al. A visual-language foundation model for computational pathology. Nat Med. 2024 Mar;30(3):863-74.
98. Zhou Y, Cui C, Yoon J, Zhang L, Deng Z, Finn C, et al. Analyzing and mitigating object hallucination in large vision-language models. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2310.00754.
99. Leng S, Zhang H, Chen G, Li X, Lu S, Miao C, et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 17-21; Seattle, WA, USA. IEEE Computer Society; 2024. p. 13872-82.
100. Huang Q, Dong X, Zhang P, Wang B, He C, Wang J, et al. Opera: alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 17-21; Seattle, WA, USA. IEEE Computer Society; 2024. p. 13418-27.
101. Anthropic. Building effective agents. 2024. Anthropic.
102. Guo T, Chen X, Wang Y, Chang R, Pei S, Chawla NV, et al. Large language model based multi-agents: a survey of progress and challenges. arXiv [Preprint]. 2024;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2402.01680.
103. Wang W, Ma Z, Wang Z, Wu C, Ji J, Chen W, et al. A survey of LLM-based agents in medicine: how far are we from Baymax? arXiv [Preprint]. 2025;[cited 2025 Jun 30]. Available from: https://doi.org/10.48550/arXiv.2502.11211.
104. Hong S, Xiao L, Zhang X, Chen J. ArgMed-agents: explainable clinical decision reasoning with LLM discussion via argumentation schemes. In: Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2024 Oct; Bangkok, Thailand. IEEE Computer Society; 2024. p. 5486-93.
105. Ravichandran AM, Grune J, Feldhus N, Burchardt A, Roller R, Moller S. XAI for better exploitation of text in medical decision support. In: In: Demner‑Fushman D, Ananiadou S, Miwa M, Roberts K, Tsujii J, editors. In: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing; 2024 Aug 16; Bangkok, Thailand. Association for Computational Linguistics; 2024. p. 506-13.
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

![]() |
![]() |