Skip to content

[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609

Open
Zerui18 wants to merge 2 commits intoray-project:masterfrom
Zerui18:llm-forward-lora-request
Open

[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609
Zerui18 wants to merge 2 commits intoray-project:masterfrom
Zerui18:llm-forward-lora-request

Conversation

@Zerui18
Copy link
Copy Markdown

@Zerui18 Zerui18 commented Apr 14, 2026

Description

vLLMEngineWrapper._prepare_llm_request in python/ray/llm/_internal/batch/stages/vllm_engine_stage.py correctly resolves per-row LoRA adapters into a vllm.lora.request.LoRARequest via _maybe_get_lora_request(row) and stores it on vLLMEngineRequest.lora_request. However, _generate_async never forwarded that field to either self.engine.generate(...) or self.engine.encode(...) — the lora_request kwarg was simply omitted from both call sites.

The effect: any user who dispatches LoRA adapters per row (e.g. by stamping row["model"] with an adapter path, which is the documented Ray Data LLM pattern for multi-adapter batch inference) had their adapters silently dropped. Every request fell back to base-model weights even though the engine was initialized with enable_lora=True, the adapter was downloaded, and the LoRARequest object was constructed.

Fix

Pass request.lora_request to both the generate and encode code paths in vLLMEngineWrapper._generate_async:

stream = self.engine.encode(
    request_id=str(request.request_id),
    prompt=llm_prompt,
    pooling_params=request.params,
    tokenization_kwargs=request.tokenization_kwargs,
    lora_request=request.lora_request,   # NEW
)
# ...
stream = self.engine.generate(
    request_id=str(request.request_id),
    prompt=llm_prompt,
    sampling_params=request.params,
    lora_request=request.lora_request,   # NEW
)

Both AsyncLLMEngine.generate and AsyncLLMEngine.encode accept an optional lora_request kwarg, so passing None (when no adapter is resolved for a row) is a no-op and fully backward-compatible.

Regression test

Added test_vllm_wrapper_forwards_lora_request_to_generate and test_vllm_wrapper_forwards_lora_request_to_encode (parametrized over EMBED / CLASSIFY / SCORE) in python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py. The tests bypass __init__, mock self.engine, build a vLLMEngineRequest with a sentinel lora_request, drive it through _generate_async, and assert the mocked engine was called with lora_request=<sentinel>. They run as pure Python unit tests (no GPU, no model download).

The existing test_vllm_wrapper_lora GPU integration test continues to exercise the actual LoRA loading path; this fix makes its dispatch effective end-to-end.

Related issues

None filed — the PR itself documents the bug.

Additional information

  • Files changed: 2 (2-line source fix + 84-line regression test)
  • Both generate() and encode() paths are fixed in the same PR to keep the bug fully closed; the encode() path has the same omission and LoRA is a legitimate use case for pooled embeddings / classification.
  • No BUILD / CODEOWNERS / docs changes required.

vLLMEngineWrapper._prepare_llm_request correctly resolves per-row LoRA
adapters into a vllm.lora.request.LoRARequest and stores it on
vLLMEngineRequest.lora_request, but _generate_async never forwarded it
to self.engine.generate(...) or self.engine.encode(...). Per-row LoRA
adapters (e.g. stamping row["model"] with an adapter path) were
silently dropped and every request fell back to base-model weights.

Pass request.lora_request to both the generate and encode code paths.
Both AsyncLLMEngine.generate and AsyncLLMEngine.encode accept an
optional lora_request kwarg, so passing None (when no adapter is
resolved) is a no-op.

Add a regression test that mocks the engine and asserts
request.lora_request is forwarded for the GENERATE path and for the
EMBED / CLASSIFY / SCORE pooling paths.

Signed-off-by: Zerui Chen <[email protected]>
@Zerui18 Zerui18 requested a review from a team as a code owner April 14, 2026 15:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that lora_request is correctly forwarded to the vLLM engine for both generation and pooling tasks, and adds corresponding regression tests. Review feedback identifies a missing request_id in the encode call and an invalid task argument in the PoolingParams constructor within the new tests.

Comment on lines 561 to +564
prompt=llm_prompt,
pooling_params=request.params,
tokenization_kwargs=request.tokenization_kwargs,
lora_request=request.lora_request,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The self.engine.encode call is missing the request_id argument. In vLLM's AsyncLLMEngine.encode, request_id is a required argument. This is also inconsistent with the self.engine.generate call below (line 568), which correctly passes the request_id.

                request_id=str(request.request_id),
                prompt=llm_prompt,
                pooling_params=request.params,
                tokenization_kwargs=request.tokenization_kwargs,
                lora_request=request.lora_request,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False-positive. request_id is still being passed on line 560 (unchaged by this PR).

multimodal_data=None,
mm_processor_kwargs=None,
multimodal_uuids=None,
params=vllm.PoolingParams(task=task_type),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

vllm.PoolingParams does not accept a task keyword argument in its constructor. Passing this will cause a TypeError during test execution. Since the task type is already correctly set on the wrapper object at line 320, you can instantiate PoolingParams without arguments.

Suggested change
params=vllm.PoolingParams(task=task_type),
params=vllm.PoolingParams(),

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in the follow-up commit - switched both tests' params to MagicMock(). Since _generate_async passes it through opaquely to the mocked engine, a real PoolingParams isn't needed and this avoids depending on vLLM constructor signatures across versions.

vllm.PoolingParams does not accept a `task` keyword argument in its
constructor. _generate_async passes request.params through opaquely
to the mocked engine.encode / engine.generate, so a MagicMock is
sufficient and avoids depending on the vLLM PoolingParams /
SamplingParams constructor signatures. Drop the now-unused
`import vllm`.

Signed-off-by: Zerui Chen <[email protected]>
@jeffreywang-anyscale jeffreywang-anyscale self-assigned this Apr 14, 2026
@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

Hey @Zerui18 thanks for your contribution. Is it possible to have an integration test rather than unit tests asserting the arguments being invoked with?

@Zerui18
Copy link
Copy Markdown
Author

Zerui18 commented Apr 14, 2026

Hey @Zerui18 thanks for your contribution. Is it possible to have an integration test rather than unit tests asserting the arguments being invoked with?

Hi Jeffrey! Happy to discuss this. I thought that the fix is a localised missing kwarg, so the unit tests are deliberately scoped to assert exactly the forwarding. A behavioural integration test would need to assert differences in LoRA-applied outputs which in turn requires a LoRA adapter fixture whose effect on generation is known and stable - that feels out of scope for this fix.

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

jeffreywang-anyscale commented Apr 14, 2026

@Zerui18 vLLM tends to modify their APIs quite a bit, so I think it's better to have an integrate test validating the end user behavior to make sure that we don't miss important use cases when we adapt to vLLM's APIs. It seems like since the introduction of v1 vLLM engine, LoRA has never been supported. Could you update this release test to validate that LoRA is exercised by inspecting the output somehow? Great catch btw.

wrapper.shutdown()


@pytest.mark.asyncio
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parametrize the method to validate against (generate or encode) to reduce code duplication?

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues llm community-contribution Contributed by the community labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core data Ray Data-related issues llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants