[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper by Zerui18 · Pull Request #62609 · ray-project/ray

Zerui18 · 2026-04-14T15:47:32Z

Description

vLLMEngineWrapper._prepare_llm_request in python/ray/llm/_internal/batch/stages/vllm_engine_stage.py correctly resolves per-row LoRA adapters into a vllm.lora.request.LoRARequest via _maybe_get_lora_request(row) and stores it on vLLMEngineRequest.lora_request. However, _generate_async never forwarded that field to either self.engine.generate(...) or self.engine.encode(...) — the lora_request kwarg was simply omitted from both call sites.

The effect: any user who dispatches LoRA adapters per row (e.g. by stamping row["model"] with an adapter path, which is the documented Ray Data LLM pattern for multi-adapter batch inference) had their adapters silently dropped. Every request fell back to base-model weights even though the engine was initialized with enable_lora=True, the adapter was downloaded, and the LoRARequest object was constructed.

Fix

Pass request.lora_request to both the generate and encode code paths in vLLMEngineWrapper._generate_async:

stream = self.engine.encode(
    request_id=str(request.request_id),
    prompt=llm_prompt,
    pooling_params=request.params,
    tokenization_kwargs=request.tokenization_kwargs,
    lora_request=request.lora_request,   # NEW
)
# ...
stream = self.engine.generate(
    request_id=str(request.request_id),
    prompt=llm_prompt,
    sampling_params=request.params,
    lora_request=request.lora_request,   # NEW
)

Both AsyncLLMEngine.generate and AsyncLLMEngine.encode accept an optional lora_request kwarg, so passing None (when no adapter is resolved for a row) is a no-op and fully backward-compatible.

Regression test

Added test_vllm_wrapper_forwards_lora_request_to_generate and test_vllm_wrapper_forwards_lora_request_to_encode (parametrized over EMBED / CLASSIFY / SCORE) in python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py. The tests bypass __init__, mock self.engine, build a vLLMEngineRequest with a sentinel lora_request, drive it through _generate_async, and assert the mocked engine was called with lora_request=<sentinel>. They run as pure Python unit tests (no GPU, no model download).

The existing test_vllm_wrapper_lora GPU integration test continues to exercise the actual LoRA loading path; this fix makes its dispatch effective end-to-end.

Related issues

None filed — the PR itself documents the bug.

Additional information

Files changed: 2 (2-line source fix + 84-line regression test)
Both generate() and encode() paths are fixed in the same PR to keep the bug fully closed; the encode() path has the same omission and LoRA is a legitimate use case for pooled embeddings / classification.
No BUILD / CODEOWNERS / docs changes required.

vLLMEngineWrapper._prepare_llm_request correctly resolves per-row LoRA adapters into a vllm.lora.request.LoRARequest and stores it on vLLMEngineRequest.lora_request, but _generate_async never forwarded it to self.engine.generate(...) or self.engine.encode(...). Per-row LoRA adapters (e.g. stamping row["model"] with an adapter path) were silently dropped and every request fell back to base-model weights. Pass request.lora_request to both the generate and encode code paths. Both AsyncLLMEngine.generate and AsyncLLMEngine.encode accept an optional lora_request kwarg, so passing None (when no adapter is resolved) is a no-op. Add a regression test that mocks the engine and asserts request.lora_request is forwarded for the GENERATE path and for the EMBED / CLASSIFY / SCORE pooling paths. Signed-off-by: Zerui Chen <[email protected]>

gemini-code-assist

Code Review

This pull request ensures that lora_request is correctly forwarded to the vLLM engine for both generation and pooling tasks, and adds corresponding regression tests. Review feedback identifies a missing request_id in the encode call and an invalid task argument in the PoolingParams constructor within the new tests.

gemini-code-assist · 2026-04-14T15:53:19Z

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

                prompt=llm_prompt,
                pooling_params=request.params,
                tokenization_kwargs=request.tokenization_kwargs,
+                lora_request=request.lora_request,


The self.engine.encode call is missing the request_id argument. In vLLM's AsyncLLMEngine.encode, request_id is a required argument. This is also inconsistent with the self.engine.generate call below (line 568), which correctly passes the request_id.

request_id=str(request.request_id), prompt=llm_prompt, pooling_params=request.params, tokenization_kwargs=request.tokenization_kwargs, lora_request=request.lora_request,

False-positive. request_id is still being passed on line 560 (unchaged by this PR).

gemini-code-assist · 2026-04-14T15:53:19Z

python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py

+        multimodal_data=None,
+        mm_processor_kwargs=None,
+        multimodal_uuids=None,
+        params=vllm.PoolingParams(task=task_type),


vllm.PoolingParams does not accept a task keyword argument in its constructor. Passing this will cause a TypeError during test execution. Since the task type is already correctly set on the wrapper object at line 320, you can instantiate PoolingParams without arguments.

Suggested change

params=vllm.PoolingParams(task=task_type),

params=vllm.PoolingParams(),

Good catch. Fixed in the follow-up commit - switched both tests' params to MagicMock(). Since _generate_async passes it through opaquely to the mocked engine, a real PoolingParams isn't needed and this avoids depending on vLLM constructor signatures across versions.

vllm.PoolingParams does not accept a `task` keyword argument in its constructor. _generate_async passes request.params through opaquely to the mocked engine.encode / engine.generate, so a MagicMock is sufficient and avoids depending on the vLLM PoolingParams / SamplingParams constructor signatures. Drop the now-unused `import vllm`. Signed-off-by: Zerui Chen <[email protected]>

jeffreywang-anyscale · 2026-04-14T16:17:33Z

Hey @Zerui18 thanks for your contribution. Is it possible to have an integration test rather than unit tests asserting the arguments being invoked with?

Zerui18 · 2026-04-14T16:35:15Z

Hey @Zerui18 thanks for your contribution. Is it possible to have an integration test rather than unit tests asserting the arguments being invoked with?

Hi Jeffrey! Happy to discuss this. I thought that the fix is a localised missing kwarg, so the unit tests are deliberately scoped to assert exactly the forwarding. A behavioural integration test would need to assert differences in LoRA-applied outputs which in turn requires a LoRA adapter fixture whose effect on generation is known and stable - that feels out of scope for this fix.

jeffreywang-anyscale · 2026-04-14T17:38:32Z

@Zerui18 vLLM tends to modify their APIs quite a bit, so I think it's better to have an integrate test validating the end user behavior to make sure that we don't miss important use cases when we adapt to vLLM's APIs. It seems like since the introduction of v1 vLLM engine, LoRA has never been supported. Could you update this release test to validate that LoRA is exercised by inspecting the output somehow? Great catch btw.

jeffreywang-anyscale · 2026-04-14T17:41:58Z

python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py

        wrapper.shutdown()


+@pytest.mark.asyncio


can we parametrize the method to validate against (generate or encode) to reduce code duplication?

Zerui18 requested a review from a team as a code owner April 14, 2026 15:47

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

jeffreywang-anyscale self-assigned this Apr 14, 2026

jeffreywang-anyscale reviewed Apr 14, 2026

View reviewed changes

ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues llm community-contribution Contributed by the community labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609

[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609
Zerui18 wants to merge 2 commits intoray-project:masterfrom
Zerui18:llm-forward-lora-request

Zerui18 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Zerui18 Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Zerui18 Apr 14, 2026

Uh oh!

jeffreywang-anyscale commented Apr 14, 2026

Uh oh!

Zerui18 commented Apr 14, 2026

Uh oh!

jeffreywang-anyscale commented Apr 14, 2026 •

edited

Loading

Uh oh!

jeffreywang-anyscale Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	params=vllm.PoolingParams(task=task_type),
	params=vllm.PoolingParams(),

		wrapper.shutdown()


		@pytest.mark.asyncio

Conversation

Zerui18 commented Apr 14, 2026

Description

Fix

Regression test

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Zerui18 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Zerui18 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale commented Apr 14, 2026

Uh oh!

Zerui18 commented Apr 14, 2026

Uh oh!

jeffreywang-anyscale commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffreywang-anyscale Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang-anyscale commented Apr 14, 2026 •

edited

Loading