Skip to content

Commit 5c3008b

Browse files
samsjamikasenghaasclaude
authored
Remove reload_weights_on_start dead code (#1829)
* dont reload weights by default * Remove reload_weights_on_start config and dead code path The reload_weights mechanism was a no-op in practice: vLLM servers start with base weights already loaded, and LoRA runs already skipped the reload. Remove the config field, server endpoint, client function, and orchestrator branching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove helo/orchestrator.toml from branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add changelog entry for reload_weights_on_start removal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Mika Senghaas <mail@mikasenghaas.de> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 54b3e4c commit 5c3008b

6 files changed

Lines changed: 3 additions & 43 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Documenting changes which affect configuration usage patterns (added/moved/remov
3737
- **`orchestrator.env.log`**: Removed. Use `orchestrator.log` for env worker logging instead (2026-01-15)
3838
- **`orchestrator.eval.retry.reraise`**: Changed default from `True` to `False`. When `False`, raises `tenacity.RetryError` after retries are exhausted instead of the original exception, allowing failed eval environments to be skipped with a warning (#1586, 2026-01-14)
3939
- **`model.ep`**: Expert parallelism now supported (with auto/custom impl only), changed from the old behaviour when `ep>1` was a no-op to a proper parallelization of the MoE layers. (#1595, 2026-01-15)
40-
- **`orchestrator.reload_weights_on_start`**: Added flag to control resetting inference weights to the base model when starting from scratch (default: True) (2026-01-21)
40+
- **`orchestrator.reload_weights_on_start`**: Removed. The reload was a no-op in practice since vLLM servers already start with base weights, and LoRA runs skipped it. (#1829, 2026-02-19)
4141
- **`orchestrator.client.elastic`**: Added elastic inference pool with DNS-based service discovery. Supports dynamic server scaling via any DNS hostname with multiple A records (Kubernetes headless services, Consul, Route53, etc.). Automatically syncs LoRA adapters on new servers and only exposes ready servers to workers (#1617, 2026-01-19)
4242
- **`model.fused_lm_head_chunk_size`**: Replaced chunk size `int | None` setting with `int | Literal["auto", "disabled"]` setting. `auto` auto-sets to 2048 if possible. `disabled` explicitly disables chunked loss (use vanilla LM head). Default behaviour is to use `auto` for RL training and `disabled` for SFT training. (not changed from previous version) (#1649, 2026-01-23)
4343
- **`client.skip_model_check`**: Added configuration to skip checking if the model is available in the inference pool. Useful for external APIs or API keys that don't support the /models endpoint (default: False) (#1543, 2026-01-06)

docs/entrypoints.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The trainer is responsible for producing an updated policy model given rollouts
1616

1717
### Inference
1818

19-
The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with two custom endpoints to enable updating the server with the latest policy: `update_weights` is used to reload model weights from a HF-compatible checkpoint on disk, and `reload_weights` is used to reset the weights to the base model in between experiments. Otherwise, we rely on vLLM's optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
19+
The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with a custom `update_weights` endpoint to reload model weights from a HF-compatible checkpoint on disk. Otherwise, we rely on vLLM's optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
2020

2121
### RL
2222

src/prime_rl/inference/vllm/server.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -76,14 +76,6 @@ async def update_weights(request: Request):
7676
return {"status": "ok"}
7777

7878

79-
@router.post("/reload_weights")
80-
async def reload_weights(request: Request):
81-
await engine_client(request).collective_rpc("reload_weights")
82-
# Reset prefix cache to invalidate KV states computed with old weights
83-
await engine_client(request).reset_prefix_cache()
84-
return {"status": "ok"}
85-
86-
8779
@router.post("/load_lora_adapter")
8880
async def load_lora_adapter(lora_request: LoadLoRAAdapterRequest, raw_request: Request):
8981
"""Load a LoRA adapter and reset the prefix cache.

src/prime_rl/orchestrator/config.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -720,12 +720,6 @@ class OrchestratorConfig(BaseSettings):
720720
# The checkpoint configuration
721721
ckpt: CheckpointConfig | None = None
722722

723-
# Whether to reset inference weights to base model when starting from scratch
724-
reload_weights_on_start: Annotated[
725-
bool,
726-
Field(description="Whether to reset inference weights to the base model when starting from scratch."),
727-
] = True
728-
729723
# The validation configuration
730724
val: ValConfig | None = None
731725

src/prime_rl/orchestrator/orchestrator.py

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@
5050
)
5151
from prime_rl.utils.client import (
5252
init_nccl_broadcast,
53-
reload_weights,
5453
setup_inference_pool,
5554
)
5655
from prime_rl.utils.heartbeat import Heartbeat
@@ -321,14 +320,7 @@ async def orchestrate(config: OrchestratorConfig):
321320
lora_name = config.model.lora.name if config.model.lora else None
322321
await inference_pool.update_weights(weights_path, lora_name=lora_name, step=scheduler.ckpt_step)
323322
else:
324-
if config.reload_weights_on_start:
325-
if config.model.lora is None:
326-
logger.info("Training from scratch. Resetting weights to base model")
327-
await reload_weights(inference_pool.admin_clients)
328-
else:
329-
logger.info("Training from scratch. Skipping base weight reload because LoRA is enabled")
330-
else:
331-
logger.info("Training from scratch. Skipping base weight reload")
323+
logger.info("Training from scratch")
332324

333325
# Iterate over dataset in batches
334326
max_steps = config.max_steps or int(1e9)

src/prime_rl/utils/client.py

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -249,24 +249,6 @@ async def _update_weights(admin_client: AsyncClient, weight_dir: str | None) ->
249249
await asyncio.gather(*[_update_weights(admin_client, weight_dir_posix) for admin_client in admin_clients])
250250

251251

252-
async def reload_weights(admin_clients: list[AsyncClient]) -> None:
253-
"""Make a HTTP post request to the vLLM server to reload weights (reset to base model)."""
254-
logger = get_logger()
255-
256-
async def _reload_weights(admin_client: AsyncClient) -> None:
257-
logger.debug("Sending request to reload weights (reset to base model)")
258-
try:
259-
response = await admin_client.post("/reload_weights", json={})
260-
response.raise_for_status()
261-
except httpx.HTTPStatusError as e:
262-
if e.response.status_code == 404:
263-
logger.warning("The route /reload_weights does not exist. Skipping weight reload.")
264-
return
265-
raise
266-
267-
await asyncio.gather(*[_reload_weights(admin_client) for admin_client in admin_clients])
268-
269-
270252
def _is_retryable_lora_error(exception: BaseException) -> bool:
271253
"""Check if an exception should trigger a retry for LoRA loading."""
272254
if isinstance(exception, httpx.HTTPStatusError):

0 commit comments

Comments
 (0)