Remove reload_weights_on_start dead code (#1829)

samsja · mikasenghaas · claude · web-flow · commit 5c3008b3a54d · 2026-02-18T17:35:18.000-08:00
* dont reload weights by default

* Remove reload_weights_on_start config and dead code path

The reload_weights mechanism was a no-op in practice: vLLM servers
start with base weights already loaded, and LoRA runs already
skipped the reload. Remove the config field, server endpoint,
client function, and orchestrator branching.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Remove helo/orchestrator.toml from branch

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Add changelog entry for reload_weights_on_start removal

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Mika Senghaas &lt;mail@mikasenghaas.de&gt;
Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -37,7 +37,7 @@ Documenting changes which affect configuration usage patterns (added/moved/remov
 - **`orchestrator.env.log`**: Removed. Use `orchestrator.log` for env worker logging instead (2026-01-15)
 - **`orchestrator.eval.retry.reraise`**: Changed default from `True` to `False`. When `False`, raises `tenacity.RetryError` after retries are exhausted instead of the original exception, allowing failed eval environments to be skipped with a warning (#1586, 2026-01-14)
 - **`model.ep`**: Expert parallelism now supported (with auto/custom impl only), changed from the old behaviour when `ep>1` was a no-op to a proper parallelization of the MoE layers. (#1595, 2026-01-15)
-- **`orchestrator.reload_weights_on_start`**: Added flag to control resetting inference weights to the base model when starting from scratch (default: True) (2026-01-21)
+- **`orchestrator.reload_weights_on_start`**: Removed. The reload was a no-op in practice since vLLM servers already start with base weights, and LoRA runs skipped it. (#1829, 2026-02-19)
 - **`orchestrator.client.elastic`**: Added elastic inference pool with DNS-based service discovery. Supports dynamic server scaling via any DNS hostname with multiple A records (Kubernetes headless services, Consul, Route53, etc.). Automatically syncs LoRA adapters on new servers and only exposes ready servers to workers (#1617, 2026-01-19)
 - **`model.fused_lm_head_chunk_size`**: Replaced chunk size `int | None` setting with `int | Literal["auto", "disabled"]` setting. `auto` auto-sets to 2048 if possible. `disabled` explicitly disables chunked loss (use vanilla LM head). Default behaviour is to use `auto` for RL training and `disabled` for SFT training. (not changed from previous version) (#1649, 2026-01-23)
 - **`client.skip_model_check`**: Added configuration to skip checking if the model is available in the inference pool. Useful for external APIs or API keys that don't support the /models endpoint (default: False) (#1543, 2026-01-06)
diff --git a/docs/entrypoints.md b/docs/entrypoints.md
@@ -16,7 +16,7 @@ The trainer is responsible for producing an updated policy model given rollouts
 
 ### Inference
 
-The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with two custom endpoints to enable updating the server with the latest policy: `update_weights` is used to reload model weights from a HF-compatible checkpoint on disk, and `reload_weights` is used to reset the weights to the base model in between experiments. Otherwise, we rely on vLLM's optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference. 
+The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with a custom `update_weights` endpoint to reload model weights from a HF-compatible checkpoint on disk. Otherwise, we rely on vLLM's optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
 
 ### RL
 
diff --git a/src/prime_rl/inference/vllm/server.py b/src/prime_rl/inference/vllm/server.py
@@ -76,14 +76,6 @@ async def update_weights(request: Request):
     return {"status": "ok"}
 
 
-@router.post("/reload_weights")
-async def reload_weights(request: Request):
-    await engine_client(request).collective_rpc("reload_weights")
-    # Reset prefix cache to invalidate KV states computed with old weights
-    await engine_client(request).reset_prefix_cache()
-    return {"status": "ok"}
-
-
 @router.post("/load_lora_adapter")
 async def load_lora_adapter(lora_request: LoadLoRAAdapterRequest, raw_request: Request):
     """Load a LoRA adapter and reset the prefix cache.
diff --git a/src/prime_rl/orchestrator/config.py b/src/prime_rl/orchestrator/config.py
@@ -720,12 +720,6 @@ class OrchestratorConfig(BaseSettings):
     # The checkpoint configuration
     ckpt: CheckpointConfig | None = None
 
-    # Whether to reset inference weights to base model when starting from scratch
-    reload_weights_on_start: Annotated[
-        bool,
-        Field(description="Whether to reset inference weights to the base model when starting from scratch."),
-    ] = True
-
     # The validation configuration
     val: ValConfig | None = None
 
diff --git a/src/prime_rl/orchestrator/orchestrator.py b/src/prime_rl/orchestrator/orchestrator.py
@@ -50,7 +50,6 @@
 )
 from prime_rl.utils.client import (
     init_nccl_broadcast,
-    reload_weights,
     setup_inference_pool,
 )
 from prime_rl.utils.heartbeat import Heartbeat
@@ -321,14 +320,7 @@ async def orchestrate(config: OrchestratorConfig):
         lora_name = config.model.lora.name if config.model.lora else None
         await inference_pool.update_weights(weights_path, lora_name=lora_name, step=scheduler.ckpt_step)
     else:
-        if config.reload_weights_on_start:
-            if config.model.lora is None:
-                logger.info("Training from scratch. Resetting weights to base model")
-                await reload_weights(inference_pool.admin_clients)
-            else:
-                logger.info("Training from scratch. Skipping base weight reload because LoRA is enabled")
-        else:
-            logger.info("Training from scratch. Skipping base weight reload")
+        logger.info("Training from scratch")
 
     # Iterate over dataset in batches
     max_steps = config.max_steps or int(1e9)
diff --git a/src/prime_rl/utils/client.py b/src/prime_rl/utils/client.py
@@ -249,24 +249,6 @@ async def _update_weights(admin_client: AsyncClient, weight_dir: str | None) ->
         await asyncio.gather(*[_update_weights(admin_client, weight_dir_posix) for admin_client in admin_clients])
 
 
-async def reload_weights(admin_clients: list[AsyncClient]) -> None:
-    """Make a HTTP post request to the vLLM server to reload weights (reset to base model)."""
-    logger = get_logger()
-
-    async def _reload_weights(admin_client: AsyncClient) -> None:
-        logger.debug("Sending request to reload weights (reset to base model)")
-        try:
-            response = await admin_client.post("/reload_weights", json={})
-            response.raise_for_status()
-        except httpx.HTTPStatusError as e:
-            if e.response.status_code == 404:
-                logger.warning("The route /reload_weights does not exist. Skipping weight reload.")
-                return
-            raise
-
-    await asyncio.gather(*[_reload_weights(admin_client) for admin_client in admin_clients])
-
-
 def _is_retryable_lora_error(exception: BaseException) -> bool:
     """Check if an exception should trigger a retry for LoRA loading."""
     if isinstance(exception, httpx.HTTPStatusError):