[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609
[llm] Forward lora_request to vLLM engine in vLLMEngineWrapper#62609Zerui18 wants to merge 2 commits intoray-project:masterfrom
Conversation
vLLMEngineWrapper._prepare_llm_request correctly resolves per-row LoRA adapters into a vllm.lora.request.LoRARequest and stores it on vLLMEngineRequest.lora_request, but _generate_async never forwarded it to self.engine.generate(...) or self.engine.encode(...). Per-row LoRA adapters (e.g. stamping row["model"] with an adapter path) were silently dropped and every request fell back to base-model weights. Pass request.lora_request to both the generate and encode code paths. Both AsyncLLMEngine.generate and AsyncLLMEngine.encode accept an optional lora_request kwarg, so passing None (when no adapter is resolved) is a no-op. Add a regression test that mocks the engine and asserts request.lora_request is forwarded for the GENERATE path and for the EMBED / CLASSIFY / SCORE pooling paths. Signed-off-by: Zerui Chen <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request ensures that lora_request is correctly forwarded to the vLLM engine for both generation and pooling tasks, and adds corresponding regression tests. Review feedback identifies a missing request_id in the encode call and an invalid task argument in the PoolingParams constructor within the new tests.
| prompt=llm_prompt, | ||
| pooling_params=request.params, | ||
| tokenization_kwargs=request.tokenization_kwargs, | ||
| lora_request=request.lora_request, |
There was a problem hiding this comment.
The self.engine.encode call is missing the request_id argument. In vLLM's AsyncLLMEngine.encode, request_id is a required argument. This is also inconsistent with the self.engine.generate call below (line 568), which correctly passes the request_id.
request_id=str(request.request_id),
prompt=llm_prompt,
pooling_params=request.params,
tokenization_kwargs=request.tokenization_kwargs,
lora_request=request.lora_request,There was a problem hiding this comment.
False-positive. request_id is still being passed on line 560 (unchaged by this PR).
| multimodal_data=None, | ||
| mm_processor_kwargs=None, | ||
| multimodal_uuids=None, | ||
| params=vllm.PoolingParams(task=task_type), |
There was a problem hiding this comment.
vllm.PoolingParams does not accept a task keyword argument in its constructor. Passing this will cause a TypeError during test execution. Since the task type is already correctly set on the wrapper object at line 320, you can instantiate PoolingParams without arguments.
| params=vllm.PoolingParams(task=task_type), | |
| params=vllm.PoolingParams(), |
There was a problem hiding this comment.
Good catch. Fixed in the follow-up commit - switched both tests' params to MagicMock(). Since _generate_async passes it through opaquely to the mocked engine, a real PoolingParams isn't needed and this avoids depending on vLLM constructor signatures across versions.
vllm.PoolingParams does not accept a `task` keyword argument in its constructor. _generate_async passes request.params through opaquely to the mocked engine.encode / engine.generate, so a MagicMock is sufficient and avoids depending on the vLLM PoolingParams / SamplingParams constructor signatures. Drop the now-unused `import vllm`. Signed-off-by: Zerui Chen <[email protected]>
|
Hey @Zerui18 thanks for your contribution. Is it possible to have an integration test rather than unit tests asserting the arguments being invoked with? |
Hi Jeffrey! Happy to discuss this. I thought that the fix is a localised missing kwarg, so the unit tests are deliberately scoped to assert exactly the forwarding. A behavioural integration test would need to assert differences in LoRA-applied outputs which in turn requires a LoRA adapter fixture whose effect on generation is known and stable - that feels out of scope for this fix. |
|
@Zerui18 vLLM tends to modify their APIs quite a bit, so I think it's better to have an integrate test validating the end user behavior to make sure that we don't miss important use cases when we adapt to vLLM's APIs. It seems like since the introduction of v1 vLLM engine, LoRA has never been supported. Could you update this release test to validate that LoRA is exercised by inspecting the output somehow? Great catch btw. |
| wrapper.shutdown() | ||
|
|
||
|
|
||
| @pytest.mark.asyncio |
There was a problem hiding this comment.
can we parametrize the method to validate against (generate or encode) to reduce code duplication?
Description
vLLMEngineWrapper._prepare_llm_requestinpython/ray/llm/_internal/batch/stages/vllm_engine_stage.pycorrectly resolves per-row LoRA adapters into avllm.lora.request.LoRARequestvia_maybe_get_lora_request(row)and stores it onvLLMEngineRequest.lora_request. However,_generate_asyncnever forwarded that field to eitherself.engine.generate(...)orself.engine.encode(...)— thelora_requestkwarg was simply omitted from both call sites.The effect: any user who dispatches LoRA adapters per row (e.g. by stamping
row["model"]with an adapter path, which is the documented Ray Data LLM pattern for multi-adapter batch inference) had their adapters silently dropped. Every request fell back to base-model weights even though the engine was initialized withenable_lora=True, the adapter was downloaded, and theLoRARequestobject was constructed.Fix
Pass
request.lora_requestto both the generate and encode code paths invLLMEngineWrapper._generate_async:Both
AsyncLLMEngine.generateandAsyncLLMEngine.encodeaccept an optionallora_requestkwarg, so passingNone(when no adapter is resolved for a row) is a no-op and fully backward-compatible.Regression test
Added
test_vllm_wrapper_forwards_lora_request_to_generateandtest_vllm_wrapper_forwards_lora_request_to_encode(parametrized overEMBED/CLASSIFY/SCORE) inpython/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py. The tests bypass__init__, mockself.engine, build avLLMEngineRequestwith a sentinellora_request, drive it through_generate_async, and assert the mocked engine was called withlora_request=<sentinel>. They run as pure Python unit tests (no GPU, no model download).The existing
test_vllm_wrapper_loraGPU integration test continues to exercise the actual LoRA loading path; this fix makes its dispatch effective end-to-end.Related issues
None filed — the PR itself documents the bug.
Additional information
generate()andencode()paths are fixed in the same PR to keep the bug fully closed; theencode()path has the same omission and LoRA is a legitimate use case for pooled embeddings / classification.