Releases · EvolvingLMMs-Lab/lmms-eval

25+ new benchmark tasks spanning document, video, math, spatial, AGI, audio, and safety domains
Unified video decode — single read_video entry point with TorchCodec backend (up to 3.58x faster), DALI GPU decode, and LRU caching
Lance-backed video distribution — MINERVA videos in a single Lance table on Hugging Face
YAML config-driven evaluation — --config replaces fragile CLI one-liners with validated, reproducible YAML files
Reasoning tag stripping — pipeline-level <think> block removal for reasoning models, configurable via --reasoning_tags
Safety & red-teaming baselines — JailbreakBench with ASR, refusal rate, toxicity, and over-refusal metrics
Token efficiency metrics — per-sample input/output/reasoning token counts and run-level throughput
Agentic task evaluation — generate_until_agentic output type with iterative tool-call loops and deterministic simulators
Async OpenAI message_format — replaces is_qwen3_vl flag with extensible format system
Flattened JSONL logs — cleaner output format for generate_until responses

New Model Backends

NanoVLM — lightweight local inference backend
Async multi-GPU HF — parallel inference across GPUs with HuggingFace Transformers

Full Release Notes

See the complete v0.7 release notes for detailed documentation of every feature, migration guide, and architecture decisions.

Install

pip install lmms-eval==0.7
# or
uv pip install lmms-eval==0.7

Assets 2

19 Feb 07:42

Luodian

v0.6.1

ee31b5a

v0.6.1: Response Cache Fix & Cleanup

What's Changed

Bug Fixes

Fix response cache returning identical results when temperature > 0 and repeat > 1 - The legacy per-model JSONL cache (Layer 1) used only doc_id as cache key without checking determinism. When running stochastic sampling with multiple repeats, all repeats silently returned the same cached response. This was a data corruption bug.
Fix multi-GPU metric gather key ordering (#1089)
Fix simple qwen3_vl inference when batch_size > 1 (#1090)
Fix PyPI publish workflow: version auto-sync from git tag, version bump to 0.6.1

Features

Response-level caching system (ResponseCache) - New SQLite + JSONL write-ahead log architecture (--use_cache ./eval_cache). Determinism-aware: automatically bypasses cache for temperature > 0, do_sample=True, n > 1. Per-rank files for distributed safety. Crash recovery via JSONL replay.
JSONL audit log records all responses - Both deterministic and non-deterministic responses are logged to JSONL for real-time observability (tail -f rank0.jsonl). Each record includes a deterministic field. Only deterministic responses are stored in SQLite for cache reuse.
SAM3 model + SA-Co/Gold benchmark (#1088)
GitHub Actions PyPI publish workflow (#1087)
Qwen3.5 runtime compatibility docs and examples (#1094)

Cleanup

Remove dead code: CachingLMM, hash_args, SqliteDict import from api/model.py
Remove buggy per-model JSONL cache: LMMS_EVAL_USE_CACHE env var, load_cache(), get_response_from_cache(), add_request_response_to_cache(), and calls in 4 models (vllm, vllm_generate, async_openai, longvila)
Remove sqlitedict dependency from pyproject.toml
Simplify CacheHook to a no-op stub (50+ models still reference self.cache_hook.add_partial(...))

Tests

34 cache tests covering: determinism detection, cache key collision prevention, hit/miss behavior, non-deterministic bypass with repeats, JSONL audit log observability, crash recovery via JSONL replay, multi-rank isolation and shard merging, model fingerprint isolation, stats accuracy across close/reopen, large batch sanity (1000 requests)
Note: loglikelihood end-to-end execute flow is not yet covered

Docs

docs/caching.md rewritten for the new ResponseCache implementation
Agent skill for lmms-eval (#1092)

Full Changelog: v0.6...v0.6.1

Assets 2

16 Feb 10:25

Luodian

v0.6

e643525

v0.6: Re-engineered Pipeline for Scalable and Statistically-Principled Evaluation

Introduction

Key Highlights:

Evaluation as a Service: evaluation runs as a standalone service, decoupled from training, serving queue-based eval requests
Statistical analysis: statistically grounded results that capture real model improvements rather than a single accuracy score (confidence intervals, clustered standard errors, paired comparison with t-test)
Runtime throughput: optimizations to max out your model runtime's capacity (~7.5x over previous versions)
Model Registry V2: manifest-driven unified model resolution with backward-compatible aliasing
50+ new evaluation tasks across spatial reasoning, knowledge, video, and multimodal domains
10+ new model integrations including Qwen3-VL, LLaMA 4, GLM4V, InternVL3/3.5
Minimum Python version: raised to 3.10

Introduction
Table of Contents
Major Features
Bug Fixes
Infrastructure & Documentation
Full Documentation

Major Features

1. Evaluation as a Service

When training produces checkpoints every N steps, evaluation becomes a scheduling problem: either pause training to evaluate (wasting GPU cycles), or build custom orchestration to run it elsewhere. v0.6 solves this with a disaggregated eval service — the training loop sends an HTTP request and moves on, the eval server handles the rest on its own GPUs. The service can be deployed as a standalone cluster, accepting eval requests from multiple training runs, whether mid-training checkpoint evaluations or post-training comprehensive benchmarks.

Training Loop  ──POST /evaluate──>  Eval Server  ──queue──>  Job Worker (GPU)
               <──poll /jobs/{id}──               <──result──

The server uses a JobScheduler that queues jobs and processes them sequentially, preventing GPU resource conflicts without manual coordination. It also supports FSDP2 sharded checkpoint merging via /merge, so distributed training checkpoints can be evaluated directly.

Integration:

from lmms_eval.entrypoints import EvalClient

client = EvalClient("http://eval-server:8000")

# Inside training loop — non-blocking
job = client.evaluate(
    model="qwen2_5_vl",
    tasks=["mmmu_val", "mme"],
    model_args={"pretrained": "Qwen/Qwen2.5-VL-7B-Instruct"},
)

# Retrieve results when needed
result = client.wait_for_job(job["job_id"])

Both sync (EvalClient) and async (AsyncEvalClient) clients are available. Full API: /evaluate, /jobs/{id}, /queue, /tasks, /models, /merge. (#972, #1001)

2. Statistical Analysis

Standard error & confidence intervals: output format score ± 1.96 × SE (95% CI) (#989)
Clustered standard errors: for benchmarks with correlated questions (e.g., multiple questions per video), specify cluster_key in task YAML. Clustered SE can be 3x larger than naive estimates (#989)
Paired comparison with t-test: per-question differences remove question difficulty variance, isolating real model differences. Reports mean_diff, CI, and p-value (#1006)
Power analysis: compute minimum sample size to detect a given effect size before running an evaluation (#1007)
Model stability metrics: run N samples per question (temp=0.7), report expected accuracy (EA), consensus accuracy (CA), internal variance (IV), and consistency rate (CR) (#997)

# Enable clustered SE in task YAML
task: videomme
cluster_key: video_id

3. Runtime Throughput

To reach the max throughput upperbound that your model runtime endpoint can provide, v0.6 adds an adaptive scheduling layer to control the concurrency level of requests sent to the runtime and achieve an around 7.5x speedup over the baseline.

Key mechanisms:

Adaptive concurrency control: adjusts in-flight requests using failure rate, rate-limit hits, and p95 latency
Refill scheduling: completed requests immediately release slots (no full-window barrier)
Prefix-aware queueing: dispatch by prefix hash for better prefill-cache hits
Retry/backoff decoupling: retry sleep is separate from request timeout

Performance:

Run Type	Concurrency	RPS	Wall Time (s)	Speedup
baseline	1	0.33	305s	1.0x
static	24	1.93	52s	5.9x
adaptive (v2)	16	2.46	41s	7.5x

Benchmark: mme task, LIMIT=100, bytedance-seed/seed-1.6-flash via OpenRouter.

Usage (async, recommended):

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=<model>,num_cpus=16,adaptive_concurrency=true,\
adaptive_min_concurrency=1,adaptive_max_concurrency=64,\
adaptive_target_latency_s=15.0,retry_backoff_s=1.0,\
prefix_aware_queue=true,prefix_hash_chars=256

(#1080, #1082)

4. Model Registry V2

Previously, each model name was wired to a single implementation through scattered import logic. Registry V2 replaces this with a manifest system: each model declares its available implementations (chat and/or simple) and any backward-compatible aliases in one place. When you pass --model openai_compatible, the registry resolves the alias to the canonical openai model, finds both chat and simple implementations, and picks chat by default. This means old scripts keep working while new integrations automatically use the better interface.

Unified resolution: --model X resolves through ModelManifest -> chat (preferred) or simple (fallback, or force_simple=True)
Aliasing: old names like openai_compatible, async_openai_compatible map to canonical openai / async_openai transparently (#1083, #1084)
Plugin support: external packages can register models via Python entry points without modifying core code
Simple mode deprecated: doc_to_visual + doc_to_text for API models is deprecated. Use doc_to_messages + ChatMessages (#1070)

5. New Tasks

Spatial & 3D reasoning:

3DSR (#1072), Spatial457 (#1031), SpatialTreeBench (#994), ViewSpatial (#983), OmniSpatial (#896)
SiteBench (#984, multi-image #996), VSIBench (debiased & pruned #975, multi-image #993)
Blink, CV_Bench, Embspatial, ERQA (#927), RefSpatial, Where2Place (#940), SpatialViz (#894)

Knowledge & reasoning:

CoreCognition (#1064), MMSU (#1058), Uni-MMMU (#1029), Geometry3K (#1030)
AuxSolidMath (#1034), MindCube (#876), MMVP (#1028), RealUnify (#1033)
IllusionBench (#1035), MME-SCI (#878), VLMs are Blind (#931), VLMs are Biased (#928)
Reasoning task versions for multiple benchmarks (#926, #1038)
VLMEvalKit-compatible Qwen task variants for MMMU and MMStar (#1021)

Video & streaming:

MMSI-Video-Bench (#1053), OVOBench (#957), Mantis-Eval (#978)
LongVT for long video with tool calling (#944), SciVideoBench (#875)
Decontamination probing for VideoMME, VideoMMMU, LongVideoBench, LVBench, LongVT (#990)

Multimodal & other:

PRISMM-Bench (#1063), OSI-bench (#1068), mmar (#1057), PAIBench-U (#1050)
SPAR-bench (#1011), BabyVision Gen (#1010) + Und (#1015)
AV-SpeakerBench (#943), imgedit bench (#941), MMSearch-Plus (#1054)
CaptionQA (#1004), StructEditBench (#1016), kris_bench (#1017)
FALCON-Bench (#942), UEval (#890), SeePhys (#903), SNSBench (#930)
STARE (#893), GroundingMe (#949), GEditBench (#939), JMMMU-Pro (#937)
WenetSpeech test_net split (#1027)

6. New Models

GLM4V, LLaMA 4 (#1056)
OmniVinci, MiniCPM-o-2_6 (#1060)
Uni-MoE-2.0-Omni, Baichuan-Omni-1d5 (#1059)
Audio Flamingo 3, Kimi Audio (#1055)
InternVL-HF (#1039), InternVL3, InternVL3.5 (#963)
Bagel UMM (#1012), Cambrian-S (#977)
Qwen3-VL (#883), Qwen3-Omni, Video-Salmonn-2 (#955)
LLaVA-OneVision-1.5 chat interface (#887)
Multi-round generation (generate_until_multi_round) for Qwen2.5-VL and Qwen2-VL (#960)

Bug Fixes

Raise minimum supported Python version to 3.10 (#1079)
Fix video loader memory leaks via resource cleanup (#1026)
Replace hardcoded .cuda() with .to(self._device) for multi-GPU support (#1024)
Fix Qwen2.5-VL nframes edge case (#992, #987)
Fix multi-image token insertion for Cambrians model (#1075)
Add dynamic max_num calculation to InternVL3 (#1069)
Fix partial support in VSIBench metric calculation (#1041)
Fix Qwen2-Audio parameter name error (#1081)
Fix InternVL3 duplicate <image> token issue (#999)
Fix hallusionbench processing for distributed eval (#885)
Fix COCO Karpathy test data loading (#884)
Fix nested dictionary input for vLLM mm_processor_kwargs (#915)
Fix log_samples missing fields in doc (#731)
Fix catastrophic backtracking in Charades eval regex
Filter multimodal content from log samples while preserving metadata (#962)
Fix Qwen2.5-VL batch size > 1 visual alignment (#971)

Infrastructure & Documentation

Developer guidance for AI agents and contributors (AGENTS.md) (#1085)
Restructured v0.6 release notes: top-down architecture overview
README: reordered sections by user journey, simplified header
Added CITATION.cff, FAQ, quickstart guide
i18n README translations for 18 languages (#979)
Scalable choice selection for evaluation (#1005)
Use dependency lower bounds for broader compatibility (#969)
New CLI options: --offset to skip first N samples (#1042)
Throughput metrics (requests/sec, wall time) in results table (#1078)

Full Documentation

...

Assets 2

07 Oct 05:20

Luodian

v0.5

8f142bc

v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.

Introduction

Key Highlights:

Audio-First: Comprehensive audio evaluation with paralinguistic analysis
Response Caching: Production-ready caching system for faster re-evaluation
5 New Models: Including audio-capable GPT-4o, LongViLA, Gemma-3
50+ New Benchmark Variants: Audio, vision, coding, and STEM tasks
MCP Integration: Model Context Protocol client support

Introduction
Major Features
Usage Examples
Technical Details
Migration Guide
Bug Fixes and Improvements
Deprecated Features
Contributing
Acknowledgments
Getting Help

Major Features

1. Response Caching System

A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:

Key Features:

Per-document caching: Cached at (task_name, doc_id) level
Distributed-safe: Separate cache files per rank/world size
Zero-overhead: Automatic cache hits with no code changes
Multi-backend: Works with async OpenAI, vLLM, and custom models

Enable Caching:

export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root"  # optional

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
  --tasks mmmu_val \
  --batch_size 1 \
  --output_path ./logs/

Cache Location:

Default: ~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl
Each line: {"doc_id": <doc_id>, "response": <string>}

API Integration:

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

See full documentation in docs/caching.md.

2. Audio Evaluation Suite

Comprehensive audio understanding capabilities with three major benchmark families:

Step2 Audio Paralinguistic (11 tasks)

Fine-grained paralinguistic feature evaluation:

Acoustic Features: pitch, rhythm, speed, voice_tone, voice_styles
Speaker Attributes: age, gender, emotions
Environmental: scene, event, vocalsound
Sematic Match metrics

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic \
  --batch_size 1

VoiceBench (9 main categories, 30+ subtasks)

Comprehensive voice and speech evaluation:

Instruction Following: ifeval, alpacaeval, advbench
Reasoning: bbh (Big Bench Hard), commoneval
Knowledge: mmsu (13 subject areas: biology, chemistry, physics, etc.)
Q&A: openbookqa
Accent Diversity: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
Expressiveness: wildvoice
Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.

# Full VoiceBench
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks voicebench \
  --batch_size 1

# Specific accent evaluation
python -m lmms_eval \
  --tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
  --batch_size 1

WenetSpeech (2 splits)

Large-scale ASR and speech evaluation:

dev: Development set for validation
test_meeting: Meeting domain evaluation
MER (Mixed Error Rate) metrics

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks wenet_speech_dev,wenet_speech_test_meeting \
  --batch_size 1

Audio Pipeline Features:

HuggingFace audio dataset integration
Unified audio message format
Multiple metric support (Accuracy, WER, GPT-4 Judge)
Task grouping for multi-subset benchmarks

3. New Model Support

Five new model integrations expanding audio and vision capabilities:

Model	Type	Key Features	Usage Example
GPT-4o Audio Preview	Audio+Text	Paralinguistic understanding, multi-turn audio	`--model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17`
Gemma-3	Vision+Text	Enhanced video handling, efficient architecture	`--model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it`
LLaVA-OneVision 1.5	Vision+Text	Improved vision understanding, latest LLaVA	`--model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b`
LongViLA-R1	Video+Text	Long-context video, efficient video processing	`--model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B`
Thyme	Vision+Text	Reasoning-focused, enhanced image handling	`--model thyme --model_args pretrained=thyme-ai/thyme-7b`

Example Usage:

# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 1

# LongViLA for video understanding
python -m lmms_eval \
  --model longvila \
  --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
  --tasks videomme,egoschema \
  --batch_size 1

4. New Benchmarks

Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:

Vision & Reasoning Benchmarks

Benchmark	Variants	Focus	Metrics
CSBench	3 (MCQ, Assertion, Combined)	Code understanding, debugging	Accuracy
SciBench	4 (Math, Physics, Chemistry, Combined)	College-level STEM	GPT-4 Judge, Accuracy
MedQA	1	Medical question answering	Accuracy
SuperGPQA	1	Graduate-level science Q&A	Accuracy
Lemonade	1	Video action recognition	Accuracy
CharXiv	3 (Descriptive, Reasoning, Combined)	Scientific chart interpretation	Accuracy, GPT-4 Judge

Example Usage:

# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1

# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1

# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1

Reproducibility Validation

We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:

Model	Task	lmms-eval	Reported	Δ	Status
Qwen-2.5-7B-Instruct	MedQA	53.89	54.28	-0.39	✓
	SciBench	43.86	42.97	+0.89	✓
	CSBench	69.01	69.51	-0.50	✓
	SuperGPQA	29.24	28.78	+0.46	✓
Llama-3.1-8B	MedQA	64.49	67.01	-2.52	✓
	SciBench	15.35	10.78	+4.57	+-
	CSBench	62.49	57.87	+4.62	+-
	SuperGPQA	21.94	19.72	+2.22	✓

Status Legend: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)

5. Model Context Protocol (MCP) Integration

Support for MCP-enabled models with tool calling:

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
  --tasks mmmu_val \
  --batch_size 1

Features:

Tool call parsing and execution
Multi-step reasoning with tools
Custom MCP server integration
See examples/chat_templates/tool_call_qwen2_5_vl.jinja for templates

6. Async OpenAI Improvements

Enhanced async API integration:

Better rate limit handling
Configurable retry logic with delays
Improved error handling
Batch size optimization for OpenAI-compatible endpoints

Common Args Support:

# Now supports additional parameters
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
  --tasks mmstar

Usage Examples

Audio Evaluation with Caching

# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 8 \
  --output_path ./audio_results/ \
  --log_samples

# Second run will use cache - much faster!

Multi-Benchmark Evaluation

# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks voicebench_mmsu,csbench,scibench_math,charxiv \
  --batch_size 4 \
  --output_path ./multimodal_results/

Distributed Evaluation with Caching

export LMMS_EVAL_USE_CACHE=True

torchrun --nproc_per_node=8 -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
  --tasks step2_audio_paralinguistic,csbench,scibench \
  --batch...

Assets 2

28 Sep 01:33

kcz358

v0.4.1

7697745

v0.4.1: Tool Calling evaluation, cache API and more models and benchmarks for v0.4.1

Main Features

Tool Calling evaluation through mcp and openai server
A unified cache API for resuming the response

Tool Calling Examples

We have now support for tool calling evaluation for models through a openai server and mcp server. To start with, you first need to setup an openai-compatible server through vllm/sglang or any of the framework

Then, you will need to write your own mcp server so that our client can connect with. An example launching commands:

accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
    --model async_openai \
    --model_args model_version=$CKPT_PATH,mcp_server_path=/path/to/mcp_server.py \
    --tasks $TASK_NAME \
    --batch_size 1 \
    --output_path ./logs/ \
    --log_samples

Cache Api

To handle cases such that the evaluation got terminate, we have create a cache api for people to resume the evaluation instead of start a completely new one. Examples of using the cache api in your generate until:

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)  # your model inference
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

More information can be found in caching.md

What's Changed

[NEW TASK] Add video task support for LSDBench by @taintaintainu in #778
feat: Add max_frame_num parameter to encode_video by @Luodian in #783
fix: Fix video loading logic and add protocol for loading by @kcz358 in #788
Fix broken references by @yaojingguo in #787
fix(qwen2_5_vl): Ensure unique frame indices for videos with few frames by @Luodian in #789
New tasks supported: EMMA by @Devininthelab in #790
Add httpx_trust_env arg for openai_compatible model by @yaojingguo in #791
[TASK] MMRefine Benchmark by @skyil7 in #793
Fix accuracy computation for VQAv2 by @oscmansan in #794
docs: Update installation instructions to use uv package manager by @Luodian in #799
Remove the duplicated Aero-1-Audio content by @yaojingguo in #803
feat: Add Charxiv, videomme long, and vllm threadpool for decoding inputs by @kcz358 in #802
[Feature] Add GPT-4o Audio by @YichenG170 in #798
[Feature] Add Thyme Model by @xjtupanda in #811
[Fix] Fix tools and add mcp client by @kcz358 in #812
[Feat] Adding cache api for model by @kcz358 in #814
fix: Async OpenAI caching order and more common args by @kcz358 in #816
[Feature] Support for Gemma-3 Models by @RadhaGulhane13 in #821
feat: Add longvila-r1 and benchmarks by @kcz358 in #819
[bugfix] fix bug in srt_api.py by @zzhbrr in #826
fix(gemma3): use Gemma3ForConditionalGeneration to load by @Luodian in #827
feat: add llava_onevision1_5 by @mathCrazyy in #825
[Feature] Add VoiceBench by @YichenG170 in #809
add script of LLaVA-OneVision1_5 by @mathCrazyy in #828
add scibench(math) task by @KelvinDo183 in #834

New Contributors

@taintaintainu made their first contribution in #778
@oscmansan made their first contribution in #794
@YichenG170 made their first contribution in #798
@xjtupanda made their first contribution in #811
@zzhbrr made their first contribution in #826
@mathCrazyy made their first contribution in #825
@KelvinDo183 made their first contribution in #834

Full Changelog: v0.4...v0.4.1

Contributors

yaojingguo, oscmansan, and 11 other contributors

Assets 2

30 Jul 04:31

Luodian

v0.4

b7b4b1d

v0.4: multi-node, tp + dp parallel, unified llm-as-judge api, `doc_to_message` support

😻 LMMs-Eval upgrades to v0.4, better evals for better models.

multi-node evals, tp+dp parallel.
new doc_to_message support for interleaved modalities inputs, fully compatible with OpenAI official message format, suitable for evaluation in more complicated tasks.
unified llm-as-judge API to support more versatile metric functions, async mode support for large concurrency and throughput.
more features:
- tool-uses for agentic tasks
- programmic API for supporting more third-party training frameworks like nanoVLM, now call LMMs-Eval in your training loop to inspect your models on more tasks.

This upgrade focuses on accelerating evaluation and improves consistency, addressing the needs of reasoning models with longer outputs, multiple rollouts, and in scenarios that LLM-as-judge is required for general domain tasks.

With LMMs-Eval, we dedicated to build the frontier evaluation toolkit to accelerate development for better multimodal models.

More at: https://github.com/EvolvingLMMs-Lab/lmms-eval

Meanwhile, we are currently building the next frontier fully open multimodal models and new supporting frameworks.

Vibe check with us: https://lmms-lab.com

What's Changed

[Improvement] Accept chat template string in vLLM models by @VincentYCYao in #768
[Feat] fix tasks and vllm to reproduce better results. by @Luodian in #774
Remove the deprecated tasks related to the nonexistent lmms-lab/OlympiadBench dataset by @yaojingguo in #776
[Feat] LMMS-Eval 0.4 by @Luodian in #721

Full Changelog: v0.3.5...v0.4

Contributors

yaojingguo, Luodian, and VincentYCYao

Assets 2

21 Jul 12:30

Luodian

v0.3.5

f7a6d6b

v0.3.5

What's Changed

pip 0.3.4 by @pufanyi in #697
[Fix] Minor fix on some warning messages by @kcz358 in #704
[FIX] Add macro metric to task xlrs-lite by @nanocm in #700
[Fix] Fix evaluator crash with accelerate backend when num_processes=1 by @miikatoi in #699
[Fix] Enable the ignored API_URL in the MathVista evaluation. by @MoyusiteruIori in #705
Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos by @hanoonaR in #702
Update sentencepiece dependency and add new parameters to mathvista_t… by @Luodian in #716
[fix ] Refactor Accelerator initialization by @Luodian in #717
[Minor] typo fixed in task_guide.md by @JulyanZhu in #720
add mmsi-bench (https://arxiv.org/abs/2505.23764) by @sihany077 in #715
add mmvu task by @pbcong in #713
Dev/tomato by @Devininthelab in #709
[fix] update korean benchmark's post_prompt by @jujeongho0 in #719
[fix] ensure synchronization not be used without distributed execution by @debugdoctor in #714
[FIX] Resolve MMMU-test submission file generation issue by @xyyandxyy in #724
Add CameraBench_VQA by @chancharikmitra in #725
[vLLM] centralize VLLM_WORKER_MULTIPROC_METHOD by @kylesayrs in #728
[fix] cli_evaluate to properly handle Namespace arguments by @Luodian in #733
Fix three bugs in the codebase by @Luodian in #734
[Bug] fix a bug in post processing stage of ScienceQA. by @ashun989 in #723
fix: add max_frames_num to OpenAICompatible by @loongfeili in #740
[Bugfix] Add min image resolution requirement for vLLM Qwen-VL models by @zch42 in #737
Revert "Pass in the 'cache_dir' to use local cache" by @kcz358 in #741
[New Benchmark] Add Video-TT Benchmark by @dongyh20 in #742
Add claude GitHub actions 1752118403023 by @Luodian in #749
[Bugfix] Fix handling of encode_video output in vllm.py so each frame’s Base64 by @LiamLian0727 in #754
[New Benchmark] Request for supporting TimeScope by @ruili33 in #756
Remove Claude GitHub workflows for code review by @Luodian in #757
[fix] Fixed applying process_* twice on resAns for VQAv2 by @Avelina9X in #760
[fix] update korean benchmark's post_prompt by @jujeongho0 in #759
Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) by @dunghuynhandy in #744
[fix] vqav2 evaluation yaml by @mletrasdl in #764
[New Task] Add support for benchmark PhyX by @wutaiqiang in #766

New Contributors

@miikatoi made their first contribution in #699
@MoyusiteruIori made their first contribution in #705
@hanoonaR made their first contribution in #702
@sihany077 made their first contribution in #715
@debugdoctor made their first contribution in #714
@xyyandxyy made their first contribution in #724
@chancharikmitra made their first contribution in #725
@loongfeili made their first contribution in #740
@zch42 made their first contribution in #737
@LiamLian0727 made their first contribution in #754
@ruili33 made their first contribution in #756
@Avelina9X made their first contribution in #760
@dunghuynhandy made their first contribution in #744
@mletrasdl made their first contribution in #764
@wutaiqiang made their first contribution in #766

Full Changelog: v0.3.4...v0.3.5

What's Changed

pip 0.3.4 by @pufanyi in #697
[Fix] Minor fix on some warning messages by @kcz358 in #704
[FIX] Add macro metric to task xlrs-lite by @nanocm in #700
[Fix] Fix evaluator crash with accelerate backend when num_processes=1 by @miikatoi in #699
[Fix] Enable the ignored API_URL in the MathVista evaluation. by @MoyusiteruIori in #705
Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos by @hanoonaR in #702
Update sentencepiece dependency and add new parameters to mathvista_t… by @Luodian in #716
[fix ] Refactor Accelerator initialization by @Luodian in #717
[Minor] typo fixed in task_guide.md by @JulyanZhu in #720
add mmsi-bench (https://arxiv.org/abs/2505.23764) by @sihany077 in #715
add mmvu task by @pbcong in #713
Dev/tomato by @Devininthelab in #709
[fix] update korean benchmark's post_prompt by @jujeongho0 in #719
[fix] ensure synchronization not be used without distributed execution by @debugdoctor in #714
[FIX] Resolve MMMU-test submission file generation issue by @xyyandxyy in #724
Add CameraBench_VQA by @chancharikmitra in #725
[vLLM] centralize VLLM_WORKER_MULTIPROC_METHOD by @kylesayrs in #728
[fix] cli_evaluate to properly handle Namespace arguments by @Luodian in #733
Fix three bugs in the codebase by @Luodian in #734
[Bug] fix a bug in post processing stage of ScienceQA. by @ashun989 in #723
fix: add max_frames_num to OpenAICompatible by @loongfeili in #740
[Bugfix] Add min image resolution requirement for vLLM Qwen-VL models by @zch42 in #737
Revert "Pass in the 'cache_dir' to use local cache" by @kcz358 in #741
[New Benchmark] Add Video-TT Benchmark by @dongyh20 in #742
Add claude GitHub actions 1752118403023 by @Luodian in #749
[Bugfix] Fix handling of encode_video output in vllm.py so each frame’s Base64 by @LiamLian0727 in #754
[New Benchmark] Request for supporting TimeScope by @ruili33 in #756
Remove Claude GitHub workflows for code review by @Luodian in #757
[fix] Fixed applying process_* twice on resAns for VQAv2 by @Avelina9X in #760
[fix] update korean benchmark's post_prompt by @jujeongho0 in #759
Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) by @dunghuynhandy in #744
[fix] vqav2 evaluation yaml by @mletrasdl in #764
[New Task] Add support for benchmark PhyX by @wutaiqiang in #766

New Contributors

@miikatoi made their first contribution in #699
@Moyusite...

Contributors

Luodian, kylesayrs, and 24 other contributors

Assets 2

30 May 07:06

Luodian

v0.3.4

a863517

v0.3.4

What's Changed

Support VSI-Bench Evaluation by @vealocia in #511
[Fix] Better Qwen omni and linting by @kcz358 in #647
Fix the bug in issue #648 by @ashun989 in #649
[New Model] Aero-1-Audio by @kcz358 in #658
[improve]: catch import error; remove unused modules by @VincentYCYao in #650
[Fix] fixing the video path of MVBench & adding default hf_home to percepti… by @jihanyang in #655
Update vllm.py by @VincentYCYao in #652
[FIX]: Fix question_for_eval key in MathVerse evaluator for Vision-Only data by @ForJadeForest in #657
[Task] Add new benchmark: CAPability by @lntzm in #656
Mathvision bug fixes , Reproduce Qwen2.5VL results by @RadhaGulhane13 in #660
Fix issue with killing process in sglang by @ravi03071991 in #666
Fixes Metadata Reading from Released PLM Checkpoints by @mmaaz60 in #665
[fix] modify the GPT evaluation model by @jujeongho0 in #668
[Fix] Correct rating logic for VITATECS benchmark by @erfanbsoula in #671
Update README.md by @pufanyi in #675
delete unused test_parse.py file by @pbcong in #676
[fix] add reminder for interleave_visual for Qwen2.5-VL, update version control. by @Luodian in #678
[fix] Fix task listing in CLI evaluation by updating to use 'all_tasks' instead of 'list_all_tasks' for improved clarity. by @Luodian in #687
[Task] V*-Bench (Visual Star Benchmark) by @Luodian in #683
support distributed executor backend - torchrun by @kaiyuyue in #680
[Task] Add new task: XLRS-Bench-lite by @nanocm in #684
Added direction for locally cached dataset in task_guide.md by @JulyanZhu in #691
Pass in the 'cache_dir' to use local cache by @JulyanZhu in #690
[FIX] Fix parameter name in qwen25vl.sh by @MasterBeeee in #693
[TASK & FIX] add task VideoEval-Pro and fix tar file concat by @iamtonymwt in #694

New Contributors

@vealocia made their first contribution in #511
@ashun989 made their first contribution in #649
@VincentYCYao made their first contribution in #650
@jihanyang made their first contribution in #655
@ForJadeForest made their first contribution in #657
@lntzm made their first contribution in #656
@RadhaGulhane13 made their first contribution in #660
@ravi03071991 made their first contribution in #666
@erfanbsoula made their first contribution in #671
@kaiyuyue made their first contribution in #680
@nanocm made their first contribution in #684
@JulyanZhu made their first contribution in #691
@MasterBeeee made their first contribution in #693
@iamtonymwt made their first contribution in #694

Full Changelog: v0.3.3...v0.3.4

Contributors

ravi03071991, Luodian, and 18 other contributors

Assets 2

20 Apr 06:26

Luodian

v0.3.3

514082e

v0.3.3 Fix models and add model examples

What's Changed

[Fix] Add padding_side="left" for Qwen2.5 to enable flash_attention by @robinhad in #620
Add ability to pass options to VLLM by @robinhad in #621
Fix Qwen by @Devininthelab in #633
Whisper + vLLM: FLEURS Evaluation Fixes and Language Prompt Injection by @shubhra in #624
Fix loading datasets from disk by @CLARKBENHAM in #629
Cache stringifies where not needed by @CLARKBENHAM in #631
openai chat.completions uses max_completion_tokens by @CLARKBENHAM in #630
MAC decord equivalent by @CLARKBENHAM in #632
[Task] Add support for VisualPuzzles by @yueqis in #637
Adds PerceptionLM and PLM-VideoBench by @mmaaz60 in #638
[Fix] Aria and LLama Vision and OpenAI compatible models by @Luodian in #641
[Feat] Enhance Qwen model with additional parameters and improved visual handling by @Luodian in #639
[Fix] add more model examples by @Luodian in #644

New Contributors

@shubhra made their first contribution in #624
@CLARKBENHAM made their first contribution in #629
@yueqis made their first contribution in #637
@mmaaz60 made their first contribution in #638

Full Changelog: v0.3.2...v0.3.3

Contributors

shubhra, robinhad, and 5 other contributors

Assets 2

Releases: EvolvingLMMs-Lab/lmms-eval

v0.7.1

Changes since v0.7

Features

Fixes

Uh oh!

v0.7 - Operational Simplicity & Pipeline Maturity

v0.7 — Operational Simplicity & Pipeline Maturity

Highlights

New Model Backends

Full Release Notes

Install

Uh oh!

v0.6.1: Response Cache Fix & Cleanup

What's Changed

Bug Fixes

Features

Cleanup

Tests

Docs

Uh oh!

v0.6: Re-engineered Pipeline for Scalable and Statistically-Principled Evaluation

Introduction

Table of Contents

Major Features

1. Evaluation as a Service

2. Statistical Analysis

3. Runtime Throughput

4. Model Registry V2

5. New Tasks

6. New Models

Bug Fixes

Infrastructure & Documentation

Full Documentation

Uh oh!

v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.

Introduction

Table of Contents

Major Features

1. Response Caching System

2. Audio Evaluation Suite

Step2 Audio Paralinguistic (11 tasks)

VoiceBench (9 main categories, 30+ subtasks)

WenetSpeech (2 splits)

3. New Model Support

4. New Benchmarks

Vision & Reasoning Benchmarks

Reproducibility Validation

5. Model Context Protocol (MCP) Integration

6. Async OpenAI Improvements

Usage Examples

Audio Evaluation with Caching

Multi-Benchmark Evaluation

Distributed Evaluation with Caching

Uh oh!

v0.4.1: Tool Calling evaluation, cache API and more models and benchmarks for v0.4.1

Main Features

Tool Calling Examples

Cache Api

What's Changed

New Contributors

Contributors

Uh oh!

v0.4: multi-node, tp + dp parallel, unified llm-as-judge api, `doc_to_message` support

What's Changed

Contributors

Uh oh!

v0.3.5

What's Changed

New Contributors

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.4

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.3 Fix models and add model examples