Skip to content

[RFC] SGLang Backend Support - 27% Throughput Improvement ⁠ #1615

@pmukeshreddy

Description

@pmukeshreddy

From the technical report:

"Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines... and straightforward integration of alternative inference engines (e.g. SGLang [56], Tokasaurus [18])."

Decided to take a shot at this integration. Here's where I'm at.

Benchmark Results (2x H100)

Metric vLLM SGLang Improvement
Throughput 300k tok/s 382k tok/s +27%
Latency 1,705 ms 1,341 ms -21%

Current State

SGLang server passes health checks, handles /v1/chat/completions, and the weight update endpoints (/update_weights, /reload_weights) work. Ran inference benchmarks against both backends with identical configs.

Blocker for End-to-End RL Training

The verifiers library expects prompt_ids, completion_ids, and is_truncated in the response (see orchestrator/trajectories.py). vLLM returns these when you pass extra_body={"return_token_ids": True}.

SGLang's chat completions endpoint only returns token counts in usage, not the actual IDs. This breaks the training loop when verifiers tries to process the trajectory.

Questions

Token ID gap - Which approach would you prefer?

Approach Drawback
Post-tokenize in SGLang proxy using HuggingFace tokenizer Adds ~10-20ms latency per request, potential token mismatch if tokenizer versions drift
Use SGLang's /generate endpoint (returns output_ids natively) Requires chat-to-raw-prompt translation layer, loses chat template handling
Upstream return_token_ids to SGLang Depends on their roadmap, not a short-term fix

NCCL weight broadcast - I have /init_broadcaster implemented but haven't tested multi-node. Does your CI have multi-GPU runners, or should I validate on my own cluster first?

Backend abstraction - Looking at env_worker.py, the tokenization is coupled to vLLM's response format. Any appetite to abstract this into a tokenizer callback, or should SGLang just match vLLM's schema exactly?

Happy to iterate. Would love to get this working end-to-end and open a PR.
Code so far: https://github.com/pmukeshreddy/prime-rl/tree/sglang-swap

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions