-
Notifications
You must be signed in to change notification settings - Fork 239
Description
From the technical report:
"Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines... and straightforward integration of alternative inference engines (e.g. SGLang [56], Tokasaurus [18])."
Decided to take a shot at this integration. Here's where I'm at.
Benchmark Results (2x H100)
| Metric | vLLM | SGLang | Improvement |
|---|---|---|---|
| Throughput | 300k tok/s | 382k tok/s | +27% |
| Latency | 1,705 ms | 1,341 ms | -21% |
Current State
SGLang server passes health checks, handles /v1/chat/completions, and the weight update endpoints (/update_weights, /reload_weights) work. Ran inference benchmarks against both backends with identical configs.
Blocker for End-to-End RL Training
The verifiers library expects prompt_ids, completion_ids, and is_truncated in the response (see orchestrator/trajectories.py). vLLM returns these when you pass extra_body={"return_token_ids": True}.
SGLang's chat completions endpoint only returns token counts in usage, not the actual IDs. This breaks the training loop when verifiers tries to process the trajectory.
Questions
Token ID gap - Which approach would you prefer?
| Approach | Drawback |
|---|---|
| Post-tokenize in SGLang proxy using HuggingFace tokenizer | Adds ~10-20ms latency per request, potential token mismatch if tokenizer versions drift |
Use SGLang's /generate endpoint (returns output_ids natively) |
Requires chat-to-raw-prompt translation layer, loses chat template handling |
Upstream return_token_ids to SGLang |
Depends on their roadmap, not a short-term fix |
NCCL weight broadcast - I have /init_broadcaster implemented but haven't tested multi-node. Does your CI have multi-GPU runners, or should I validate on my own cluster first?
Backend abstraction - Looking at env_worker.py, the tokenization is coupled to vLLM's response format. Any appetite to abstract this into a tokenizer callback, or should SGLang just match vLLM's schema exactly?
Happy to iterate. Would love to get this working end-to-end and open a PR.
Code so far: https://github.com/pmukeshreddy/prime-rl/tree/sglang-swap