Blog

Standards for use of LLM in medical diagnosis

Thе study on the Standards for use of LLM in medical diagnosis explores the strengths and limitations of large language models (LLMs) in medicine, with a focus on how they compare to existing systems like clinical decision support systems (CDSS) and experienced human clinicians. It also evaluates how well these models align with outcomes from randomized controlled trials (RCTs), which remain the gold standard in medical decision-making.

While LLMs like GPT-3 and GPT-4 have demonstrated impressive performance in simulated tasks, such as passing the United States Medical Licensing Examination, they still fall short of the accuracy and reliability required in clinical environments. In some cases, their recommendations included potentially harmful advice. 

Most current studies rely on narrow datasets or artificial benchmarks that don’t reflect the variability seen in everyday clinical practice. We find that comparative studies between LLMs and human physicians, using identical clinical scenarios, offer valuable insights. Another issue is that generative AI models are regularly updated and fine-tuned. This creates a moving target for clinical validation. The study emphasizes the importance of continous evaluation loops: LLMs should be built, implemented, and then constantly assessed using real-world clinician feedback. 

While large language models hold significant potential to reduce clinician workload and improve patient outcomes, they are not yet ready for unsupervised clinical use. Developers and healthcare professionals must work together to establish rigorous, standardized evaluation protocols and ensure these systems are safe, fair, and effective.

To top