-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
Description
Goal: Help developers identify where the regression originates from in Ray Data LLM jobs.
Approach: In data LLM's release tests, we run benchmark with single-GPU generation workload. When there's a regression, it's not immediately clear if the regression originates from the vLLM engine or Ray Data LLM. Once #60385 lands, we should be able to profile the vLLM engine (e.g. TPOT, token throughput, E2E request latency) and print the metrics in the benchmark output so that it's clear whether the regression results from the vLLM engine. We can also profile the ray data LLM job.
Relevant file: https://github.com/ray-project/ray/blob/ad1b87448fec4db7ef11f1697f9bc02ae6a7ba09/release/llm_tests/batch/test_batch_single_node_vllm.py
Hardware requirement: You will need a GPU for this task.
Use case
No response