What happened?
On a Dynamo DynamoGraphDeployment running with optimization_target=latency, the Dynamo Planner emits multiple back-to-back replica scale-up patches against the DGD (typical: 3–6 patches over 20–30 s while load is sustained). Each patch is translated by the Dynamo operator into a PodCliqueSet spec update. Grove appears to treat this rapid sequence of spec updates as a full rollout rather than an in-place replica bump, which has these consequences:
-
Every pod in the gang (Frontend, Planner, PrefillWorker, DecodeWorker) is killed and recreated inside a ~5 s window.
-
The race between the operator applying replicas=N and Grove applying a new spec hash produces intermediate prefill pods that are created and then re-deleted mid-reconcile (e.g. prefill-5ckwr, prefill-zkxzw in the run below).
-
The final replicas value on the impacted PodClique is 1, while updatedReplicas reaches 2 and is never reset — the DGD is left in Ready=False / state: pending indefinitely:
conditions:
- message: 'Resources not ready: tc-2-2-3-late-da27: podclique/tc-2-2-3-late-da27-0-vllmprefillworker:
desired=1, updated=2'
reason: some_resources_are_not_ready
status: "False"
type: Ready
services:
VllmPrefillWorker:
readyReplicas: 1
replicas: 1
updatedReplicas: 2
The same workload with optimization_target=throughput (which produces ~1 patch per scale-up cycle instead of a burst) takes the in-place PodCreateSuccessful path and does not trigger a gang roll until minutes later.
Observed pod lifecycle during the burst (run 2, ~2 s window)
```
UpdatePodCliqueSet fires →
CREATE prefill-8gbqv (scale-up attempt #1)
CREATE decode-wprtt (Grove roll)
CREATE planner-bxn7v (Grove roll)
CREATE frontend-58rql (Grove roll)
CREATE prefill-5ckwr (scale-up attempt #2)
DELETE planner-r9mcf (Gen 0 gone)
DELETE prefill-8stjg
DELETE frontend-cbflf
DELETE decode-fdqgs
DELETE prefill-5ckwr ← killed ~instantly
CREATE prefill-zkxzw (scale-up attempt #3)
... ~100 s later ...
DELETE prefill-zkxzw ← killed again
Final: 1 prefill, 1 decode, 1 planner, 1 frontend
```
Two independent runs reproduced the same failure mode (88% and 96% client-request failure during the roll window).
What did you expect to happen?
A series of replica-only spec updates on a PodCliqueSet should be reconciled as in-place replica changes (scale up/down only the affected PodClique), not as a full gang rollout. New replicas requested by an in-flight scale-up patch should not be created and then immediately deleted by a competing reconcile pass. updatedReplicas should not be left greater than replicas after reconciliation settles.
Environment
- Kubernetes version: (Dynamo rc5 test cluster — version not captured in report)
- Grove version: shipped with
nvcr.io/nvstaging/ai-dynamo/dynamo-operator 1.1.0rc5 — please advise which Grove SHA this corresponds to
- Scheduler: Grove (
grove-operator, PodCliqueSet) + run:ai PodGroup sidecar
- Workload: Dynamo
DynamoGraphDeployment (vLLM disagg, Qwen3-8B), Planner driving replica patches via the Dynamo operator
- Reproduces consistently on
optimization_target=latency; does not reproduce on optimization_target=throughput (same cluster, same fixture, only that field differs)
Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal). Full attachments — event timelines, planner logs, DGD snapshot — are linked from there; happy to mirror excerpts into this issue on request.
What happened?
On a Dynamo
DynamoGraphDeploymentrunning withoptimization_target=latency, the Dynamo Planner emits multiple back-to-back replica scale-up patches against the DGD (typical: 3–6 patches over 20–30 s while load is sustained). Each patch is translated by the Dynamo operator into aPodCliqueSetspec update. Grove appears to treat this rapid sequence of spec updates as a full rollout rather than an in-place replica bump, which has these consequences:Every pod in the gang (Frontend, Planner, PrefillWorker, DecodeWorker) is killed and recreated inside a ~5 s window.
The race between the operator applying
replicas=Nand Grove applying a new spec hash produces intermediate prefill pods that are created and then re-deleted mid-reconcile (e.g.prefill-5ckwr,prefill-zkxzwin the run below).The final
replicasvalue on the impacted PodClique is 1, whileupdatedReplicasreaches 2 and is never reset — the DGD is left inReady=False/state: pendingindefinitely:The same workload with
optimization_target=throughput(which produces ~1 patch per scale-up cycle instead of a burst) takes the in-placePodCreateSuccessfulpath and does not trigger a gang roll until minutes later.Observed pod lifecycle during the burst (run 2, ~2 s window)
```
UpdatePodCliqueSet fires →
CREATE prefill-8gbqv (scale-up attempt #1)
CREATE decode-wprtt (Grove roll)
CREATE planner-bxn7v (Grove roll)
CREATE frontend-58rql (Grove roll)
CREATE prefill-5ckwr (scale-up attempt #2)
DELETE planner-r9mcf (Gen 0 gone)
DELETE prefill-8stjg
DELETE frontend-cbflf
DELETE decode-fdqgs
DELETE prefill-5ckwr ← killed ~instantly
CREATE prefill-zkxzw (scale-up attempt #3)
... ~100 s later ...
DELETE prefill-zkxzw ← killed again
Final: 1 prefill, 1 decode, 1 planner, 1 frontend
```
Two independent runs reproduced the same failure mode (88% and 96% client-request failure during the roll window).
What did you expect to happen?
A series of replica-only spec updates on a
PodCliqueSetshould be reconciled as in-place replica changes (scale up/down only the affectedPodClique), not as a full gang rollout. New replicas requested by an in-flight scale-up patch should not be created and then immediately deleted by a competing reconcile pass.updatedReplicasshould not be left greater thanreplicasafter reconciliation settles.Environment
nvcr.io/nvstaging/ai-dynamo/dynamo-operator1.1.0rc5 — please advise which Grove SHA this corresponds togrove-operator,PodCliqueSet) + run:ai PodGroup sidecarDynamoGraphDeployment(vLLM disagg, Qwen3-8B), Planner driving replica patches via the Dynamo operatoroptimization_target=latency; does not reproduce onoptimization_target=throughput(same cluster, same fixture, only that field differs)Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal). Full attachments — event timelines, planner logs, DGD snapshot — are linked from there; happy to mirror excerpts into this issue on request.