Skip to content

Commit e643525

Browse files
Luodianpufanyi
andauthored
docs: restructure README and v0.6 release notes (#1086)
* docs: restructure README and v0.6 release notes - Restructure v0.6 doc Section 1 from bottom-up to top-down architecture overview: Pipeline (1.1) -> Model Interface (1.2) -> API Concurrency (1.3) - Restore original "Why lmms-eval?" voice with v0.6 insights integrated naturally (statistical rigor, evaluation as infrastructure) - Reorder README sections by user journey: Why -> Quickstart -> Usage -> Advanced - Collapse i18n language links into expandable details tag - Simplify quick links from 3 rows to 2 rows - Update title to "LMMs-Eval: Probing Intelligence in the Real World" - Add comprehensive CHANGELOG.md covering v0.6 highlights and ~182 commits * fix: add CITATION.cff and remove dead repr_scripts.sh link - Copy CITATION.cff from feat/api-model-concurrency (enables GitHub's "Cite this repository" button and fixes broken FAQ reference) - Remove stale miscs/repr_scripts.sh link (file was renamed in #544, then deleted in #644; README was never updated) * Fix formatting of evaluation components diagram * Revise mermaid diagram for async pipeline with cache Updated mermaid diagram to improve clarity and formatting. --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
1 parent 9bac529 commit e643525

File tree

4 files changed

+491
-264
lines changed

4 files changed

+491
-264
lines changed

CHANGELOG.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Changelog
2+
3+
## v0.6 (2026-02-16)
4+
5+
### Highlights
6+
7+
- **~7.5x API throughput improvement** via adaptive concurrency control, refill scheduling, prefix-aware queueing, and retry/backoff decoupling
8+
- **Statistical analysis framework**: confidence intervals, clustered standard errors, paired comparison, power analysis, and model stability metrics
9+
- **Evaluation as a Service**: HTTP eval server for async job submission decoupled from training loops
10+
- **Model Registry V2**: manifest-driven unified model resolution with backward-compatible aliasing
11+
- **50+ new evaluation tasks** and **10+ new model integrations**
12+
- **Minimum Python version**: raised to 3.10
13+
14+
### Architecture & Performance
15+
16+
- **Adaptive concurrency control** for API-backed evaluation (`async_openai`, `openai`). The controller adjusts in-flight request count using three online signals: failure rate, rate-limit hit rate (429/throttling), and p95 latency against a target budget. Measured ~7.5x throughput gain over static single-concurrency baseline on `mme` with `LIMIT=100`. (#1080, #1082)
17+
- **Refill-style scheduling**: completed requests immediately release slots for new work, eliminating the full-window barrier where the slowest request gates the entire batch.
18+
- **Prefix-aware queueing**: reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (#1080)
19+
- **Retry/backoff decoupling**: `retry_backoff_s` is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
20+
- **Throughput metrics in results table**: final output now includes requests/sec and wall time for each task. (#1078)
21+
- **`--offset` option**: skip the first N samples in a dataset, useful for resuming partial runs or debugging specific subsets. (#1042)
22+
23+
### Model Registry V2
24+
25+
- **Manifest-driven model resolution** (`ModelRegistryV2`): all model names resolve through a single path. Two dicts in `models/__init__.py` (`AVAILABLE_SIMPLE_MODELS`, `AVAILABLE_CHAT_TEMPLATE_MODELS`) declare available models, merged into `ModelManifest` objects at startup. Chat is always preferred over simple unless `force_simple=True`. (#1070)
26+
- **Unified OpenAI model naming**: canonical names shortened from `openai_compatible` / `async_openai_compatible` to `openai` / `async_openai`. Old names continue to work as aliases via `MODEL_ALIASES`. File renames: `chat/openai_compatible.py` -> `chat/openai.py`, `simple/openai_compatible.py` -> `simple/openai.py`. (#1083, #1084)
27+
- **Simple mode deprecation**: the `doc_to_visual` + `doc_to_text` interface for API models is deprecated. New integrations should use `doc_to_messages` + `ChatMessages`.
28+
29+
### Statistical Analysis
30+
31+
- **CLT and clustered standard error estimation**: for benchmarks with correlated questions (e.g., multiple questions per video), specify `cluster_key` in task YAML to apply cluster-robust SE correction. Clustered SE can be 3x larger than naive estimates. Output format: `score +/- 1.96 x SE` (95% CI). (#989)
32+
- **Baseline-anchored paired comparison**: paired t-test on per-question differences `d_i = score_A - score_B`, removing question difficulty variance to isolate model difference signal. Reports `mean_diff`, CI, and p-value. (#1006)
33+
- **Power analysis**: compute minimum sample size to detect a given effect size (e.g., 2% improvement) before running an evaluation. Rule of thumb: reliable benchmarks need `n > 1000`. (#1007)
34+
- **Model stability metrics**: run N samples per question (temp=0.7), report expected accuracy (EA), consensus accuracy (CA), internal variance (IV), and consistency rate (CR).
35+
- **Decontamination probing**: settings for detecting potential data contamination in video benchmarks. (#990)
36+
37+
### Evaluation as a Service
38+
39+
- **HTTP eval server**: FastAPI-based server with endpoints for job submission (`/evaluate`), status polling (`/jobs/{id}`), queue management (`/queue`), and resource discovery (`/tasks`, `/models`). Includes `JobScheduler` for sequential GPU resource management. (#972)
40+
- **Client libraries**: `EvalClient` (sync) and `AsyncEvalClient` (async) for programmatic job submission from training loops.
41+
- **Web UI**: React + FastAPI web interface replacing the terminal TUI, with model/task selection, real-time command preview, and live output streaming. (#1001)
42+
43+
### New Tasks
44+
45+
**Spatial & 3D reasoning**:
46+
- 3DSR (#1072), Spatial457 (#1031), SpatialTreeBench (#994), ViewSpatial (#983), OmniSpatial (#896)
47+
- SiteBench (#984, multi-image #996), VSIBench (debiased & pruned #975, multi-image #993)
48+
- Blink, CV_Bench, Embspatial, ERQA (#927), RefSpatial, Where2Place (#940)
49+
- SpatialViz (#894)
50+
51+
**Knowledge & reasoning**:
52+
- CoreCognition (#1064), MMSU (#1058), Uni-MMMU (#1029), Geometry3K (#1030)
53+
- AuxSolidMath (#1034), MindCube (#876), MMVP (#1028), RealUnify (#1033)
54+
- IllusionBench (#1035), MME-SCI (#878), VLMs are Blind (#931), VLMs are Biased (#928)
55+
- Reasoning task versions for multiple benchmarks (#926, #1038)
56+
- VLMEvalKit-compatible Qwen task variants for MMMU and MMStar (#1021)
57+
58+
**Video & streaming**:
59+
- MMSI-Video-Bench (#1053), OVOBench (#957), Mantis-Eval (#978)
60+
- LongVT for long video with tool calling (#944), SciVideoBench (#875)
61+
62+
**Multimodal & other**:
63+
- PRISMM-Bench (#1063), OSI-bench (#1068), mmar (#1057), PAIBench-U (#1050)
64+
- SPAR-bench (#1011), BabyVision Gen (#1010) + Und (#1015)
65+
- AV-SpeakerBench (#943), imgedit bench (#941), MMSearch-Plus (#1054)
66+
- CaptionQA (#1004), StructEditBench (#1016), kris_bench (#1017)
67+
- FALCON-Bench (#942), UEval (#890), SeePhys (#903), SNSBench (#930)
68+
- STARE (#893), GroundingMe (#949), GEditBench (#939), JMMMU-Pro (#937)
69+
- WenetSpeech test_net split (#1027)
70+
71+
### New Models
72+
73+
- **GLM4V, LLaMA 4** (#1056)
74+
- **OmniVinci, MiniCPM-o-2_6** (#1060)
75+
- **Uni-MoE-2.0-Omni, Baichuan-Omni-1d5** (#1059)
76+
- **Audio Flamingo 3, Kimi Audio** (#1055)
77+
- **InternVL-HF** (#1039), **InternVL3, InternVL3.5** (#963)
78+
- **Bagel UMM** (#1012), **Cambrian-S** (#977)
79+
- **Qwen3-VL** (#883), **Qwen3-Omni, Video-Salmonn-2** (#955)
80+
- **LLaVA-OneVision-1.5** chat interface (#887)
81+
- **Multi-round generation** (`generate_until_multi_round`) for Qwen2.5-VL and Qwen2-VL (#960)
82+
83+
### Bug Fixes
84+
85+
- Raise minimum supported Python version to 3.10 (#1079)
86+
- Fix video loader memory leaks via resource cleanup (#1026)
87+
- Replace hardcoded `.cuda()` with `.to(self._device)` for multi-GPU support (#1024)
88+
- Fix Qwen2.5-VL nframes edge case (#992, #987)
89+
- Fix multi-image token insertion for Cambrians model (#1075)
90+
- Add dynamic `max_num` calculation to InternVL3 (#1069)
91+
- Fix `partial` support in VSIBench metric calculation (#1041)
92+
- Fix Qwen2-Audio parameter name error (#1081)
93+
- Fix InternVL3 duplicate `<image>` token issue (#999)
94+
- Fix hallusionbench processing for distributed eval (#885)
95+
- Fix COCO Karpathy test data loading (#884)
96+
- Fix nested dictionary input for vLLM `mm_processor_kwargs` (#915)
97+
- Fix log_samples missing fields in doc (#731)
98+
- Fix catastrophic backtracking in Charades eval regex
99+
- Filter multimodal content from log samples while preserving metadata (#962)
100+
- Fix Qwen2.5-VL batch size > 1 visual alignment (#971)
101+
102+
### Infrastructure & Documentation
103+
104+
- Developer guidance for AI agents and contributors (AGENTS.md) (#1085)
105+
- Restructured v0.6 release notes: top-down architecture overview
106+
- README: reordered sections by user journey, simplified header
107+
- Added CITATION.cff, FAQ, quickstart guide
108+
- i18n README translations for 18 languages (#979)
109+
- Scalable choice selection for evaluation (#1005)
110+
- Use dependency lower bounds for broader compatibility (#969)

CITATION.cff

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
cff-version: 1.2.0
2+
message: "If you use this software, please cite it as below."
3+
title: "LMMs-Eval: Accelerating the Development of Large Multimodal Models"
4+
type: software
5+
authors:
6+
- name: "MC Evil Team"
7+
license: MIT
8+
version: "0.5.0"
9+
date-released: "2024-03-01"
10+
url: "https://github.com/EvolvingLMMs-Lab/lmms-eval"
11+
repository-code: "https://github.com/EvolvingLMMs-Lab/lmms-eval"
12+
keywords:
13+
- multimodal evaluation
14+
- large language models
15+
- vision language models
16+
- benchmark
17+
- evaluation framework
18+
preferred-citation:
19+
type: article
20+
title: "LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models"
21+
authors:
22+
- family-names: "Zhang"
23+
given-names: "Kaichen"
24+
- family-names: "Li"
25+
given-names: "Bo"
26+
- family-names: "Zhang"
27+
given-names: "Peiyuan"
28+
- family-names: "Pu"
29+
given-names: "Fanyi"
30+
- family-names: "Cahyono"
31+
given-names: "Joshua Adrian"
32+
- family-names: "Hu"
33+
given-names: "Kairui"
34+
- family-names: "Liu"
35+
given-names: "Shuai"
36+
- family-names: "Zhang"
37+
given-names: "Yuanhan"
38+
- family-names: "Yang"
39+
given-names: "Jingkang"
40+
- family-names: "Li"
41+
given-names: "Chunyuan"
42+
- family-names: "Liu"
43+
given-names: "Ziwei"
44+
year: 2024
45+
url: "https://arxiv.org/abs/2407.12772"
46+
identifiers:
47+
- type: other
48+
value: "arXiv:2407.12772"
49+
description: "arXiv preprint"

0 commit comments

Comments
 (0)