[RFC] SGLang Backend Support - 27% Throughput Improvement ⁠



From the technical report:
> "Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines... and straightforward integration of alternative inference engines (e.g. SGLang [56], Tokasaurus [18])."

Decided to take a shot at this integration. Here's where I'm at.

## Benchmark Results (2x H100)

| Metric | vLLM | SGLang | Improvement |
|--------|------|--------|-------------|
| **Throughput** | 300k tok/s | 382k tok/s | **+27%** |
| **Latency** | 1,705 ms | 1,341 ms | **-21%** |

## Current State

SGLang server passes health checks, handles `/v1/chat/completions`, and the weight update endpoints (`/update_weights`, `/reload_weights`) work. Ran inference benchmarks against both backends with identical configs.

## Blocker for End-to-End RL Training

The `verifiers` library expects `prompt_ids`, `completion_ids`, and `is_truncated` in the response (see `orchestrator/trajectories.py`). vLLM returns these when you pass `extra_body={"return_token_ids": True}`.

SGLang's chat completions endpoint only returns token counts in `usage`, not the actual IDs. This breaks the training loop when verifiers tries to process the trajectory.

## Questions

**Token ID gap** - Which approach would you prefer?

| Approach | Drawback |
|----------|----------|
| Post-tokenize in SGLang proxy using HuggingFace tokenizer | Adds ~10-20ms latency per request, potential token mismatch if tokenizer versions drift |
| Use SGLang's `/generate` endpoint (returns `output_ids` natively) | Requires chat-to-raw-prompt translation layer, loses chat template handling |
| Upstream `return_token_ids` to SGLang | Depends on their roadmap, not a short-term fix |


**NCCL weight broadcast** - I have `/init_broadcaster` implemented but haven't tested multi-node. Does your CI have multi-GPU runners, or should I validate on my own cluster first?

**Backend abstraction** - Looking at `env_worker.py`, the tokenization is coupled to vLLM's response format. Any appetite to abstract this into a tokenizer callback, or should SGLang just match vLLM's schema exactly?

Happy to iterate. Would love to get this working end-to-end and open a PR.
*Code so far:* https://github.com/pmukeshreddy/prime-rl/tree/sglang-swap



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] SGLang Backend Support - 27% Throughput Improvement ⁠ #1615

Benchmark Results (2x H100)

Current State

Blocker for End-to-End RL Training

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	vLLM	SGLang	Improvement
Throughput	300k tok/s	382k tok/s	+27%
Latency	1,705 ms	1,341 ms	-21%

Approach	Drawback
Post-tokenize in SGLang proxy using HuggingFace tokenizer	Adds ~10-20ms latency per request, potential token mismatch if tokenizer versions drift
Use SGLang's `/generate` endpoint (returns `output_ids` natively)	Requires chat-to-raw-prompt translation layer, loses chat template handling
Upstream `return_token_ids` to SGLang	Depends on their roadmap, not a short-term fix

[RFC] SGLang Backend Support - 27% Throughput Improvement ⁠ #1615

Description

Benchmark Results (2x H100)

Current State

Blocker for End-to-End RL Training

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] SGLang Backend Support - 27% Throughput Improvement ⁠ #1615