Skip to content

Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565

@gflarity

Description

@gflarity

What happened?

On a Dynamo DynamoGraphDeployment running with optimization_target=latency, the Dynamo Planner emits multiple back-to-back replica scale-up patches against the DGD (typical: 3–6 patches over 20–30 s while load is sustained). Each patch is translated by the Dynamo operator into a PodCliqueSet spec update. Grove appears to treat this rapid sequence of spec updates as a full rollout rather than an in-place replica bump, which has these consequences:

  1. Every pod in the gang (Frontend, Planner, PrefillWorker, DecodeWorker) is killed and recreated inside a ~5 s window.

  2. The race between the operator applying replicas=N and Grove applying a new spec hash produces intermediate prefill pods that are created and then re-deleted mid-reconcile (e.g. prefill-5ckwr, prefill-zkxzw in the run below).

  3. The final replicas value on the impacted PodClique is 1, while updatedReplicas reaches 2 and is never reset — the DGD is left in Ready=False / state: pending indefinitely:

    conditions:
    - message: 'Resources not ready: tc-2-2-3-late-da27: podclique/tc-2-2-3-late-da27-0-vllmprefillworker:
        desired=1, updated=2'
      reason: some_resources_are_not_ready
      status: "False"
      type: Ready
    services:
      VllmPrefillWorker:
        readyReplicas: 1
        replicas: 1
        updatedReplicas: 2

The same workload with optimization_target=throughput (which produces ~1 patch per scale-up cycle instead of a burst) takes the in-place PodCreateSuccessful path and does not trigger a gang roll until minutes later.

Observed pod lifecycle during the burst (run 2, ~2 s window)

```
UpdatePodCliqueSet fires →
CREATE prefill-8gbqv (scale-up attempt #1)
CREATE decode-wprtt (Grove roll)
CREATE planner-bxn7v (Grove roll)
CREATE frontend-58rql (Grove roll)
CREATE prefill-5ckwr (scale-up attempt #2)
DELETE planner-r9mcf (Gen 0 gone)
DELETE prefill-8stjg
DELETE frontend-cbflf
DELETE decode-fdqgs
DELETE prefill-5ckwr ← killed ~instantly
CREATE prefill-zkxzw (scale-up attempt #3)
... ~100 s later ...
DELETE prefill-zkxzw ← killed again
Final: 1 prefill, 1 decode, 1 planner, 1 frontend
```

Two independent runs reproduced the same failure mode (88% and 96% client-request failure during the roll window).

What did you expect to happen?

A series of replica-only spec updates on a PodCliqueSet should be reconciled as in-place replica changes (scale up/down only the affected PodClique), not as a full gang rollout. New replicas requested by an in-flight scale-up patch should not be created and then immediately deleted by a competing reconcile pass. updatedReplicas should not be left greater than replicas after reconciliation settles.

Environment

  • Kubernetes version: (Dynamo rc5 test cluster — version not captured in report)
  • Grove version: shipped with nvcr.io/nvstaging/ai-dynamo/dynamo-operator 1.1.0rc5 — please advise which Grove SHA this corresponds to
  • Scheduler: Grove (grove-operator, PodCliqueSet) + run:ai PodGroup sidecar
  • Workload: Dynamo DynamoGraphDeployment (vLLM disagg, Qwen3-8B), Planner driving replica patches via the Dynamo operator
  • Reproduces consistently on optimization_target=latency; does not reproduce on optimization_target=throughput (same cluster, same fixture, only that field differs)

Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal). Full attachments — event timelines, planner logs, DGD snapshot — are linked from there; happy to mirror excerpts into this issue on request.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions