[ICLR'26] LiveClin: A Live Clinical Benchmark

📃 Paper • 🤗 Dataset • 💻 Code

LiveClin is a contamination-free, biannually updated clinical benchmark for evaluating large vision-language models on realistic, multi-stage clinical case reasoning with medical images and tables.
Each case presents a clinical scenario followed by a sequence of multiple-choice questions (MCQs) that mirror the progressive diagnostic workflow a clinician would follow — from initial presentation through diagnosis, treatment, complication management, and follow-up.

Updates 🔔

[2026.02.27] Evaluation framework refactored.
[2026.02.21] Paper released.
[2026.02.15] LiveClin is published!

Project Structure 🏗️

Click to expand

LiveClin/
├── evaluate.py                    # CLI entry-point
├── liveclin/                      # Core package
│   ├── __init__.py                #   EvalConfig dataclass
│   ├── client.py                  #   Async API client (shared connection pool)
│   ├── runner.py                  #   Multi-turn evaluation engine
│   ├── analyzer.py                #   Fine-grained results analysis
│   ├── data.py                    #   HuggingFace download & JSONL loading
│   └── utils.py                   #   Prompt formatting & answer extraction
├── scripts/
│   ├── serve_sglang.py            # SGLang deployment helper
│   └── test_vision.py             # Vision capability smoke test
├── requirements.txt
└── README.md

Main Results 📊

Overall case accuracy, showing models grouped by family and ordered reverse chronologically

Bar textures indicate model type and dashed lines represent physician reference levels.

Data Example 🩺

An example simulating the entire clinical pathway

The case progresses from initial assessment to long-term management, with new clinical information and diverse imaging modalities (e.g., X-ray, MRI, pathology, CT) progressively introduced at each key decision point.

Quick Start 🚀

1. Install

git clone https://github.com/AQ-MedAI/LiveClin.git
cd LiveClin
pip install -r requirements.txt

2. Evaluate

A single command downloads the dataset (on first run) and runs the full pipeline:

# Remote API — images sent as URLs
python evaluate.py \
    --model gpt-5 \
    --api-base https://api.openai.com/v1 \
    --api-key sk-xxx \
    --image-mode url

For locally-served models (e.g. via SGLang), --api-key can be omitted:

python evaluate.py \
    --model Qwen2.5-VL-7B-Instruct \
    --api-base http://localhost:8000/v1 \
    --image-mode local

The evaluation pipeline will:

Auto-download the dataset from HuggingFace (only the requested config, cached for future runs)
Evaluate all cases concurrently via multi-turn conversation
Print a structured summary to the terminal
Save detailed results with fine-grained analysis to JSON

Example terminal output:

============================================================
  LiveClin Results: GPT-5 (2025_H1)
============================================================
  Question Accuracy:  5179/6605 (78.4%)
  Case Accuracy:      433/1407 (30.8%)
------------------------------------------------------------
  By Chapter (Top-5 Case Accuracy):
    Chapter 4: Endocrine, nutritional ...     ( 79 cases)  C-Acc 45.6%  Q-Acc 83.3%
    Chapter 12: Diseases of the skin ...      ( 40 cases)  C-Acc 45.0%  Q-Acc 81.7%
    ...
  By Chapter (Bottom-5 Case Accuracy):
    Chapter 14: Diseases of the geni...       ( 80 cases)  C-Acc 22.5%  Q-Acc 76.5%
    Chapter 11: Diseases of the dige...       (131 cases)  C-Acc 24.4%  Q-Acc 74.6%
    ...
------------------------------------------------------------
  By Subcategory (Top-5 Case Accuracy):
    Mental disorders due to substance...      ( 10 cases)  C-Acc 60.0%  Q-Acc 89.1%
    Dermatitis and eczema (L20-L30)           ( 10 cases)  C-Acc 60.0%  Q-Acc 84.8%
    ...
  By Subcategory (Bottom-5 Case Accuracy):
    Glomerular diseases (N00-N08)             ( 20 cases)  C-Acc 10.0%  Q-Acc 72.3%
    Renal tubulo-interstitial diseases...     ( 20 cases)  C-Acc 15.0%  Q-Acc 73.1%
    ...
------------------------------------------------------------
  By Rarity:
    Rare         (1181 cases)  Q-Acc 78.5%  C-Acc 31.0%
    Unrare       ( 226 cases)  Q-Acc 78.0%  C-Acc 29.6%
------------------------------------------------------------
  By Clinical Stage:
    Presentation & Assessment      (1618 MCQs)  Q-Acc 77.8%
    Diagnosis & Interpretation     (2168 MCQs)  Q-Acc 75.0%
    Therapeutic Strategy           (1601 MCQs)  Q-Acc 83.6%
    Complication Management        ( 184 MCQs)  Q-Acc 76.1%
    Follow-up                      ( 391 MCQs)  Q-Acc 86.2%
------------------------------------------------------------
  By Question Position:
    Q1     (1407 MCQs)  Q-Acc 78.5%  Err 0.1%
    Q2     (1407 MCQs)  Q-Acc 76.5%  Err 0.4%
    ...
------------------------------------------------------------
  By Image Modality:
    CT                   ( 832 MCQs)  Q-Acc 76.4%
    MRI                  ( 621 MCQs)  Q-Acc 78.2%
    Clinical Photo       ( 504 MCQs)  Q-Acc 74.1%
    ...
------------------------------------------------------------
  By Table Modality:
    Lab Results          (1023 MCQs)  Q-Acc 79.8%
    Medications          ( 412 MCQs)  Q-Acc 82.3%
    ...
============================================================

3. Test Vision (Optional)

Verify the model can perceive images before running a full evaluation:

# Remote API
python scripts/test_vision.py \
    --model gpt-5 \
    --api-base https://api.openai.com/v1 \
    --api-key sk-xxx

# Local deployment (--api-key can be omitted)
python scripts/test_vision.py \
    --model your-model \
    --api-base http://localhost:8000/v1

4. Self-Hosted Models (Optional)

Deploy your own model with SGLang to expose an OpenAI-compatible API:

# Terminal 1 — launch the model server
python scripts/serve_sglang.py \
    --model-path /path/to/your-model \
    --tp 2 --dp 4 --port 8000

# Terminal 2 — run evaluation
python evaluate.py \
    --model your-model-name \
    --api-base http://localhost:8000/v1 \
    --image-mode local

CLI Reference ⚙️

Flag	Description	Default
`--model`	Model identifier (required)	—
`--api-base`	API base URL (required)	—
`--api-key`	API key (omit for local deployments)	`token`
`--image-mode`	`url` or `local` (required)	—
`--dataset`	Dataset config name	`2025_H1`
`--concurrency`	Max concurrent case evaluations	`100`
`--output`	Output JSON path	auto
`--resume`	Resume and retry failed cases	off
`--max-retries`	Max retries per API call	`5`
`--temperature`	Sampling temperature	`0.0`
`--max-tokens`	Max tokens per response	`16384`
`--verbose`	Print per-MCQ retry details	off
`--data-dir`	Root directory for auto-downloaded data	`data`
`--jsonl-path`	Override: direct path to JSONL file	—
`--image-root`	Override: direct path to image directory	—

Data 📦

Auto-Download (Default)

No extra steps needed. On first run, only the requested dataset config (e.g. 2025_H1) is downloaded from HuggingFace and cached locally in data/.

Manual Download

For offline use or shared storage, download the dataset yourself:

# Via git (requires git-lfs)
git lfs install
git clone https://huggingface.co/datasets/AQ-MedAI/LiveClin /path/to/liveclin-data

# Or via Python
python -c "from huggingface_hub import snapshot_download; snapshot_download('AQ-MedAI/LiveClin', repo_type='dataset', local_dir='/path/to/liveclin-data')"

Then point the evaluator to your local copy:

# Set the data root (auto-resolves internal structure)
python evaluate.py ... --data-dir /path/to/liveclin-data

# Or point directly to specific files (highest priority)
python evaluate.py ... --jsonl-path /path/to/2025_H1.jsonl --image-root /path/to/image/

Path priority: --jsonl-path / --image-root > --data-dir > default (data/).

Load with `datasets`

from datasets import load_dataset

ds = load_dataset("AQ-MedAI/LiveClin", "2025_H1", split="test")

case = ds[0]
fp = case["exam_creation"]["final_policy"]
print(fp["scenario"])
for mcq in fp["mcqs"]:
    print(f"[{mcq['stage']}] {mcq['question'][:80]}...")
    print(f"  Answer: {mcq['correct_answer']}")

Retry & Resume 🔄

The framework applies a three-layer retry strategy for robust evaluation under unstable network conditions:

Layer	Scope	Behavior
API	Single API call	Retries on timeout, connection error, rate limit, 5xx with exponential backoff
MCQ	Single question	If all API retries fail, retries the whole question before abandoning the case
Run	`--resume` flag	Re-runs only failed cases; successfully completed cases are preserved

# Resume after a run with transient failures
python evaluate.py --model gpt-5 --api-base ... --api-key ... --image-mode url --resume

Output Format 📄

Results are saved as a single JSON file (default: results/<model>_<dataset>.json):

{
  "meta": {
    "model": "gpt-5",
    "dataset": "2025_H1",
    "image_mode": "url",
    "started_at": "...",
    "finished_at": "..."
  },
  "summary": {
    "total_cases": 1407,
    "total_mcqs": 6605,
    "question_accuracy": ...,
    "case_accuracy": ...,
    "by_rarity":          { "rare": {...}, "unrare": {...} },
    "by_chapter":         { "Chapter 2: Neoplasms": {...}, ... },
    "by_subcategory":     { "Chapter 2: Neoplasms": { "Subcategory A": {...}, ... }, ... },
    "by_stage":           { "Presentation & Assessment": {...}, ... },
    "by_position":        { "Q1": {...}, "Q2": {...}, ... },
    "by_image_modality":  { "CT": {...}, "MRI": {...}, ... },
    "by_table_modality":  { "Lab Results": {...}, "Medications": {...}, ... }
  },
  "cases": [...]
}

Analysis Dimensions

Dimension	Granularity	Categories
Rarity	2 groups	Rare (84%), Non-rare (16%)
ICD-10 Chapter	16 chapters	Disease-system-level breakdown
ICD-10 Subcategory	48 groups	Nested under chapters — fine-grained ICD-10 category breakdown
Clinical Stage	5 categories	Presentation & Assessment, Diagnosis & Interpretation, Therapeutic Strategy, Complication Management, Follow-up
Question Position	Q1–Q6	Accuracy and error rate by position within each case
Image Modality	11 types	X-ray, CT, MRI, Ultrasound, Clinical Photo, Endoscopy, Angiography, PET & SPECT, Pathology, Biosignals, Diagram & Plot
Table Modality	9 types	Lab Results, Medications, Demographics, Monitoring, Literature, Genomics, Pathology & IHC, Procedures, Staging Systems

Citation 📝

If you find LiveClin useful, please cite:

@misc{wang2026liveclinliveclinicalbenchmark,
      title={LiveClin: A Live Clinical Benchmark without Leakage},
      author={Xidong Wang and Shuqi Guo and Yue Shen and Junying Chen and Jian Wang and Jinjie Gu and Ping Zhang and Lei Liu and Benyou Wang},
      year={2026},
      eprint={2602.16747},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.16747},
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
liveclin		liveclin
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR'26] LiveClin: A Live Clinical Benchmark

Updates 🔔

Project Structure 🏗️

Main Results 📊

Data Example 🩺

Quick Start 🚀

1. Install

2. Evaluate

3. Test Vision (Optional)

4. Self-Hosted Models (Optional)

CLI Reference ⚙️

Data 📦

Auto-Download (Default)

Manual Download

Load with `datasets`

Retry & Resume 🔄

Output Format 📄

Analysis Dimensions

Citation 📝

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR'26] LiveClin: A Live Clinical Benchmark

Updates 🔔

Project Structure 🏗️

Main Results 📊

Data Example 🩺

Quick Start 🚀

1. Install

2. Evaluate

3. Test Vision (Optional)

4. Self-Hosted Models (Optional)

CLI Reference ⚙️

Data 📦

Auto-Download (Default)

Manual Download

Load with datasets

Retry & Resume 🔄

Output Format 📄

Analysis Dimensions

Citation 📝

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Load with `datasets`

Packages