Skip to content

[data][llm] Profile the vLLM engine in data LLM release benchmarks #60935

@jeffreywang-anyscale

Description

@jeffreywang-anyscale

Description

Goal: Help developers identify where the regression originates from in Ray Data LLM jobs.

Approach: In data LLM's release tests, we run benchmark with single-GPU generation workload. When there's a regression, it's not immediately clear if the regression originates from the vLLM engine or Ray Data LLM. Once #60385 lands, we should be able to profile the vLLM engine (e.g. TPOT, token throughput, E2E request latency) and print the metrics in the benchmark output so that it's clear whether the regression results from the vLLM engine. We can also profile the ray data LLM job.

Relevant file: https://github.com/ray-project/ray/blob/ad1b87448fec4db7ef11f1697f9bc02ae6a7ba09/release/llm_tests/batch/test_batch_single_node_vllm.py

Hardware requirement: You will need a GPU for this task.

Use case

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    community-backlogenhancementRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to Ray

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions