Standards for use of LLM in medical diagnosis

7 August, 2025

Thе study on the Standards for use of LLM in medical diagnosis explores the strengths and limitations of large language models (LLMs) in medicine, with a focus on how they compare to existing systems like clinical decision support systems (CDSS) and experienced human clinicians. It also evaluates how well these models align with outcomes from randomized controlled trials (RCTs), which remain the gold standard in medical decision-making.

While LLMs like GPT-3 and GPT-4 have demonstrated impressive performance in simulated tasks, such as passing the United States Medical Licensing Examination, they still fall short of the accuracy and reliability required in clinical environments. In some cases, their recommendations included potentially harmful advice.

Most current studies rely on narrow datasets or artificial benchmarks that don’t reflect the variability seen in everyday clinical practice. We find that comparative studies between LLMs and human physicians, using identical clinical scenarios, offer valuable insights. Another issue is that generative AI models are regularly updated and fine-tuned. This creates a moving target for clinical validation. The study emphasizes the importance of continous evaluation loops: LLMs should be built, implemented, and then constantly assessed using real-world clinician feedback.

While large language models hold significant potential to reduce clinician workload and improve patient outcomes, they are not yet ready for unsupervised clinical use. Developers and healthcare professionals must work together to establish rigorous, standardized evaluation protocols and ensure these systems are safe, fair, and effective.

Use and Limitations of ChatGPT in Mental Health Disorders

ChatGPT is one of the most advanced and rapidly evolving large language model-based chatbots, thus we have explored the Use and Limitations of ChatGPT in Mental Health Disorders. It excels in everything from handling simple questions to performing complex medical examinations. While current technology cannot replace the expertise and judgment of skilled psychiatrists, it can assist […]

Evaluating a Nationally Localized AI Chatbot for Personalized Primary Care Guidance: Insights from the HomeDOCtor Deployment in Slovenia

In response to the growing demand for accessible and trustworthy healthcare in regions experiencing physician shortages, Slovenia has developed an innovative AI-powered digital health assistant named HomeDOCtor. This conversational chatbot is designed to provide citizens with around-the-clock primary care guidance, offering personalized medical support grounded in national clinical protocols. Developed by the Jožef Stefan Institute […]

Decoding Alzheimer’s: A Unified Framework for Building Knowledge Graphs

Paper: Dobreva, J., Simjanoska Misheva, M., Mishev, K., Trajanov, D., & Mishkovski, I. (2025). A Unified Framework for Alzheimer’s Disease Knowledge Graphs: Architectures, Principles, and Clinical Translation. Brain Sciences, 15(5), 523. https://doi.org/10.3390/brainsci15050523 Alzheimer’s disease remains one of the most complex and pressing neurological challenges of our time. Scientists across disciplines continue to uncover critical clues […]

Leveraging Federated Learning for Secure Transfer and Deployment of ML Models in Healthcare.

Federated Learning: A New Era for Privacy-Preserving AI in Healthcare Scientific advancements are often born out of the need to solve pressing problems. In the world of artificial intelligence (AI) and healthcare, one major challenge has been balancing the immense potential of AI-driven insights with the critical need to protect patient data. A newly published […]

Testing ChatGPT’s Performance on Medical Diagnostic Tasks

This study evaluated ChatGPT’s performance in medical diagnostic tasks. ChatGPT was assessed by comparing medical diagnoses to a professional medical expert system, NetDoktor’s Symptom-Checker. To achieve this, a semi-automated validation procedure was developed, which uses NetDoktor as a golden model to generate sets of symptoms and corresponding diagnoses. These symptom descriptions were then used as […]

Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine

This paper addresses the growing importance of validating Large Language Models (LLMs) in the medical domain, focusing on prompt engineering. This work proposes a structured methodology using combinatorial testing to systematically evaluate LLM responses to medical queries. The approach generates test cases by combining sets of symptoms with various prompt components, utilizing pairwise combinatorial testing […]