Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale

### What happened?

On a Dynamo `DynamoGraphDeployment` running with `optimization_target=latency`, the Dynamo Planner emits multiple back-to-back replica scale-up patches against the DGD (typical: 3–6 patches over 20–30 s while load is sustained). Each patch is translated by the Dynamo operator into a `PodCliqueSet` spec update. Grove appears to treat this rapid sequence of spec updates as a full rollout rather than an in-place replica bump, which has these consequences:

1. Every pod in the gang (Frontend, Planner, PrefillWorker, DecodeWorker) is killed and recreated inside a ~5 s window.
2. The race between the operator applying `replicas=N` and Grove applying a new spec hash produces intermediate prefill pods that are **created and then re-deleted mid-reconcile** (e.g. `prefill-5ckwr`, `prefill-zkxzw` in the run below).
3. The final `replicas` value on the impacted PodClique is 1, while `updatedReplicas` reaches 2 and is never reset — the DGD is left in `Ready=False` / `state: pending` indefinitely:

   ```yaml
   conditions:
   - message: 'Resources not ready: tc-2-2-3-late-da27: podclique/tc-2-2-3-late-da27-0-vllmprefillworker:
       desired=1, updated=2'
     reason: some_resources_are_not_ready
     status: "False"
     type: Ready
   services:
     VllmPrefillWorker:
       readyReplicas: 1
       replicas: 1
       updatedReplicas: 2
   ```

The same workload with `optimization_target=throughput` (which produces ~1 patch per scale-up cycle instead of a burst) takes the in-place `PodCreateSuccessful` path and does **not** trigger a gang roll until minutes later.

#### Observed pod lifecycle during the burst (run 2, ~2 s window)

\`\`\`
UpdatePodCliqueSet fires →
  CREATE prefill-8gbqv     (scale-up attempt #1)
  CREATE decode-wprtt      (Grove roll)
  CREATE planner-bxn7v     (Grove roll)
  CREATE frontend-58rql    (Grove roll)
  CREATE prefill-5ckwr     (scale-up attempt #2)
  DELETE planner-r9mcf     (Gen 0 gone)
  DELETE prefill-8stjg
  DELETE frontend-cbflf
  DELETE decode-fdqgs
  DELETE prefill-5ckwr     ← killed ~instantly
  CREATE prefill-zkxzw     (scale-up attempt #3)
... ~100 s later ...
  DELETE prefill-zkxzw     ← killed again
  Final: 1 prefill, 1 decode, 1 planner, 1 frontend
\`\`\`

Two independent runs reproduced the same failure mode (88% and 96% client-request failure during the roll window).

### What did you expect to happen?

A series of replica-only spec updates on a `PodCliqueSet` should be reconciled as in-place replica changes (scale up/down only the affected `PodClique`), not as a full gang rollout. New replicas requested by an in-flight scale-up patch should not be created and then immediately deleted by a competing reconcile pass. `updatedReplicas` should not be left greater than `replicas` after reconciliation settles.

### Environment

- Kubernetes version: (Dynamo rc5 test cluster — version not captured in report)
- Grove version: shipped with `nvcr.io/nvstaging/ai-dynamo/dynamo-operator` 1.1.0rc5 — please advise which Grove SHA this corresponds to
- Scheduler: Grove (`grove-operator`, `PodCliqueSet`) + run:ai PodGroup sidecar
- Workload: Dynamo `DynamoGraphDeployment` (vLLM disagg, Qwen3-8B), Planner driving replica patches via the Dynamo operator
- Reproduces consistently on `optimization_target=latency`; does **not** reproduce on `optimization_target=throughput` (same cluster, same fixture, only that field differs)

Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal). Full attachments — event timelines, planner logs, DGD snapshot — are linked from there; happy to mirror excerpts into this issue on request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565

What happened?

Observed pod lifecycle during the burst (run 2, ~2 s window)

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565

Description

What happened?

Observed pod lifecycle during the burst (run 2, ~2 s window)

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions