Upgrade vLLM from 0.10.1.1 to 0.14.1 by NathanHB · Pull Request #1173 · huggingface/lighteval

NathanHB · 2026-02-19T15:08:14Z

Update pyproject.toml to vllm>=0.11.0
Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
Add comprehensive test suite for V1 engine compatibility
Add smoke tests for quick validation

Changes:

pyproject.toml: Updated vllm version constraint
vllm_model.py: Updated get_tokenizer import path
llm_as_judge.py: Updated get_tokenizer import path
Added smoke_test_vllm_v11.py: Quick validation tests
Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests

All tests passing - V1 engine compatible, basic inference working.

- Update pyproject.toml to vllm>=0.11.0 - Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers - Add comprehensive test suite for V1 engine compatibility - Add smoke tests for quick validation Changes: - pyproject.toml: Updated vllm version constraint - vllm_model.py: Updated get_tokenizer import path - llm_as_judge.py: Updated get_tokenizer import path - Added smoke_test_vllm_v11.py: Quick validation tests - Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests All tests passing - V1 engine compatible, basic inference working.

bot-ci-comment · 2026-02-19T15:11:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ng cleanup The vLLM slow tests were failing with OOM errors when running after accelerate tests. The issue was: 1. vLLM V1 engine requires a specific amount of free GPU memory at startup 2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB) 3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB Fixes: - Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB) - Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests - Improve AsyncVLLMModel.cleanup() to properly delete model object The gpu_memory_utilization parameter only affects KV cache allocation and does not impact model outputs with temperature=0.0, so this change is safe.

The CI test was failing with 'ValueError: To serve at least one request with the model's max seq len (8192), 1.5 GiB KV cache is needed, which is larger than the available KV cache memory (1.42 GiB).' Root cause: - Tesla T4 GPU (15.36 GB) in CI environment - With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache - Required 1.5 GiB for max_seq_len=8192 - Shortfall: 80 MB Fix: - Increase gpu_memory_utilization from 0.35 to 0.4 - Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement) - Does not affect model outputs with temperature=0.0 (deterministic)

This commit addresses two issues: 1. Fix vLLM engine initialization failure in CI - Root cause: Triton library requires Python.h headers to compile CUDA utilities - Solution: Install python3.10-dev package in CI workflow - Error was: 'fatal error: Python.h: No such file or directory' 2. Add comprehensive GPU memory monitoring for slow tests - Add _log_gpu_memory() helper function in conftest.py - Log GPU memory before/after each slow test (device, total, allocated, reserved, free) - Add memory logging to model cleanup methods: * VLLMModel.cleanup() * AsyncVLLMModel.cleanup() * TransformersModel.cleanup() - Shows memory freed during cleanup operations This will help diagnose OOM issues and verify proper memory cleanup between tests. Changes: - .github/workflows/slow_tests.yaml: Add python3.10-dev installation step - tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture - src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods - src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup

The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler) for JIT kernel compilation during warmup. The error was: 'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist' Fixes: - Set CUDA_HOME=/usr/local/cuda-12.4 environment variable - Add /usr/local/cuda-12.4/bin to PATH for nvcc access - This allows FlashInfer to JIT-compile custom attention kernels Previous fixes in this PR: - ✅ Installed python3.10-dev for Python.h headers (Triton compilation) - ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache - ✅ Added comprehensive GPU memory monitoring GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB), so the issue is purely build-time tooling for JIT compilation.

The vLLM v1 engine spawns subprocesses that don't inherit environment variables set in . The previous fix set CUDA_HOME in the GitHub Actions environment, but the vLLM EngineCore subprocess couldn't access it, causing: '/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found' Fix: - Set CUDA_HOME and PATH directly in the test run command - This ensures the environment variables are inherited by all subprocesses - Now nvcc will be found during FlashInfer JIT compilation The issue was subprocess environment isolation, not the parent environment.

- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90 - Cache /usr/local/cuda-12.8 to speed up subsequent CI runs - Add verification step to check nvcc availability - Update CUDA_HOME and PATH to use CUDA 12.8 - Use export in test run to ensure subprocess inherits environment variables This fixes the issue where vLLM v0.15.x with FlashInfer backend requires nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5). Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found

…seq_len_to_capture - Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len - Replace model.model_config.max_seq_len_to_capture with max_model_len for async model - This attribute was renamed in vLLM v0.15.x Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'

…mpt_token_ids - Replace prompt_token_ids= with prompts= in LLM.generate() calls - Update both VLLMModel and AsyncVLLMModel - Update llm_as_judge.py for VLLM backend In vLLM v0.15.x, the LLM.generate() method signature changed: - Old: generate(prompt_token_ids=..., sampling_params=...) - New: generate(prompts=..., sampling_params=...) Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'

…structure In vLLM v0.15.x, the prompt_logprobs structure changed: - Now returns dict[int, Logprob] at each position (FlatLogprobs class) - Only contains top-k tokens (default was 1, causing KeyError for continuation tokens) - Need to access logprobs_at_position[token] instead of direct dict access Changes: 1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included 2. Add defensive error handling with helpful message if token not found 3. Update variable names for clarity (logprobs -> logprobs_at_position)

NathanHB added 8 commits February 19, 2026 15:36

NathanHB force-pushed the upgrade/vllm-0.14.1 branch from 6d4c9ea to 62e28f4 Compare February 20, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade vLLM from 0.10.1.1 to 0.14.1#1173

Upgrade vLLM from 0.10.1.1 to 0.14.1#1173
NathanHB wants to merge 10 commits intomainfrom
upgrade/vllm-0.14.1

NathanHB commented Feb 19, 2026

Uh oh!

bot-ci-comment bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanHB commented Feb 19, 2026

Uh oh!

bot-ci-comment bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant