Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine

4 February, 2025

This paper, part of the IS2024 conference proceedings, addresses the growing importance of validating Large Language Models (LLMs) in the medical domain, focusing on prompt engineering. This work proposes a structured methodology using combinatorial testing to systematically evaluate LLM responses to medical queries. The approach generates test cases by combining sets of symptoms with various prompt components, utilizing pairwise combinatorial testing to efficiently cover a wide range of prompt variations without causing a combinatorial explosion.

The proposed validation pipeline implements a semi-automated scoring system based on a “golden model” that provides diagnoses curated by medical professionals. The methodology aims to reduce costs and increase efficiency compared to testing all possible combinations of prompt parameters, which is particularly crucial when evaluating LLMs on large medical corpora. The approach was demonstrated using GPT-4o as the system under test and NetDoktor’s Symptom-Checker as the golden model.

In a preliminary study, significant differences in output for prompt variations given the same set of symptoms were found. Out of 24 test cases generated for an exemplary set of symptoms, only one achieved a full overlap with the golden model when using GPT-4o. This finding underscores the dependence on well-formulated prompts and the need for thorough testing strategies, especially in critical domains like medicine. The study highlights the potential of combinatorial testing in prompt engineering and emphasizes the importance of rigorous validation methodologies for LLMs in healthcare applications.

How the Brussels Effect Shapes Institutional Practice for Trustworthy AI-in-Health Research

Artificial intelligence in healthcare is often discussed through the lens of model performance, clinical usefulness and technological innovation. Yet even the most advanced AI system cannot be considered trustworthy if the institution developing it lacks the procedures, responsibilities and technical controls required to govern it properly. Our new paper, “Institutional Operationalisation of the Converging EU […]

Cognitive Horizons: When Art Became Part of the School of Minds

In a summer school dedicated to understanding the mind, art became another language for exploring it. Alongside lectures, workshops, and discussions, the School of Minds introduced another dimension: reflection. The artworks were not illustrations of technology. They were visual provocations, inviting visitors to think about intelligence, memory, cognition, identity, and the increasingly blurred boundary between […]

ChatMED Summer School 2 in Sokobanja: Turning Neurology and AI into a Shared Language

From 27 to 31 May 2026, the ChatMED project held its second summer school in Sokobanja, Serbia, under the theme “Neurology and AI Fusion.” The event brought together participants from medicine, computer science, engineering and AI research with one central goal: to build a common understanding of how artificial intelligence can meaningfully support neurological care. […]

From Regulation to Practice: FCSE Strengthens Institutional Readiness for Trustworthy AI-in-Health Research

As artificial intelligence becomes increasingly embedded in healthcare, research institutions are facing a new kind of responsibility. Scientific excellence is no longer enough. Institutions that develop, test or coordinate AI-based health solutions must also demonstrate that they understand how to protect data, manage cybersecurity risks, ensure transparency, prepare for audits, and align their work with […]

ChatMED Summer School: Neurology and AI Fusion to Take Place in Sokobanja, Serbia

ChatMED is pleased to announce the upcoming ChatMED Summer School: Neurology and AI Fusion, which will take place in Sokobanja, Serbia, from 27–31 May 2026. Organized within the framework of the ChatMED project, funded under Grant Agreement ID: 101159214, the Summer School will bring together participants from medicine, computer science, and AI research to explore […]

AI in Medicine: Not Science Fiction, but a Question of Education

The ChatMED project coordinator, prof. Monika Simjanoska Misheva, had the pleasure this morning of joining Македонија наутро to talk about what Generative AI really means for healthcare and why the conversation in North Macedonia needs to move from “whether” to “how.” A few of the key messages from our discussion: 🔹 AI is already here. Artificial intelligence isn’t new […]

Successful Training Mission in Ljubljana Strengthens eHealth, Legislation, and EU4Health Capacities

A two-day training mission held in Ljubljana (April 16-17) successfully brought together key stakeholders from North Macedonia and leading European research institutions, marking an important step toward strengthening national capacities in eHealth and participation in the EU4Health Programme. The mission gathered decision-makers from the Ministry of Health of North Macedonia, the Macedonian Red Cross, and […]

Building National Capacity for EU4Health: ChatMED Supports the First Strategic Training

North Macedonia is entering an important new phase in its integration with European health initiatives. With the country becoming a member of the EU4Health programme, a national working group has recently been formed to help coordinate participation and maximize the opportunities offered by this major EU funding instrument. Within this context, the ChatMED project is […]

Extraction of Knowledge Representations for Reasoning from Medical Questionnaires

Many people have tried an online symptom checker: it asks a series of questions (“Where does it hurt?”, “Do you have a fever?”) and then suggests a short list of possible diagnoses. But there is a problem if we want to trust and test such systems, especially modern AI chatbots and large language models that […]

Can We See How Large Language Models Think?

A Topological Perspective on Explainable AI Large Language Models (LLMs) have become astonishingly capable, but also deeply opaque.They generate fluent text, reason across domains, and assist in high-stakes settings, yet their internal decision-making remains largely a black box. In our newly published paper, “Exploring the Potential of Topological Data Analysis for Explainable Large Language Models: […]

Prompt-to-Pill: Connecting AI Systems Across the Entire Drug Journey

Drug discovery is not a single prediction problem.It is a chain of decisions, from molecule design to clinical evaluation, where each step constrains the next. Yet most AI systems today still operate in silos, solving individual tasks without preserving continuity across the pipeline. Prompt-to-Pill explores a different paradigm: treating drug development as an end-to-end workflow, […]

How to Align Medical AI with the EU AI Act and MyHealth@EU

The integration of Artificial Intelligence (AI) into healthcare is moving faster than ever. However, for European developers, this innovation comes with a formidable “dual-compliance” challenge. On one hand, AI-based Clinical Decision Support Systems (CDSS) are classified as high-risk under the EU AI Act, requiring strict safety and transparency controls. On the other, the cross-border exchange […]