Skip to content

fix(kubernetes): harden Deployment + Namespace reconcile + add lifecycle convergence tests#220

Closed
sam-goodwin wants to merge 1 commit intomainfrom
claude/harden-k8s-deployment
Closed

fix(kubernetes): harden Deployment + Namespace reconcile + add lifecycle convergence tests#220
sam-goodwin wants to merge 1 commit intomainfrom
claude/harden-k8s-deployment

Conversation

@sam-goodwin
Copy link
Copy Markdown
Contributor

Replace the single KubernetesApiError envelope with status-specific tagged errors so retries are scoped instead of blanket-catching auth/validation failures, wait for Deployment rollouts to converge, and wait for Namespace finalizers before recreating downstream objects.

Reconciler changes

// status-specific tagged errors — callers can catchTag precisely
KubernetesNotFound       // 404
KubernetesConflict       // 409  (resourceVersion race / namespace terminating)
KubernetesThrottled      // 429
KubernetesNetworkError   // transport (TLS, ECONNRESET, DNS)
KubernetesDeploymentNotReady
KubernetesDeleteNotComplete
  • 429 + transport errors retry with bounded exponential backoff (8 retries).
  • 409 on apply retries with bounded backoff (10 retries) — covers both resourceVersion races and namespace is being terminated.
  • After applying a Deployment, poll until status.observedGeneration === metadata.generation and readyReplicas/updatedReplicas reach spec.replicas (5 minute cap). Surfaces KubernetesDeploymentNotReady instead of returning before pods come up.
  • After deleting a Namespace, poll GET until 404 so a subsequent apply doesn't race the finalizer. Surfaces KubernetesDeleteNotComplete if the namespace stays stuck.
  • DELETE is idempotent on 404 (already gone) and bounded-retry on 409.

New lifecycle tests

packages/alchemy/test/Kubernetes/client.test.ts — pure unit coverage:

  • buildKubernetesObjectPath — core vs apis group, cluster-scoped vs namespaced, missing-namespace throws
  • kubernetesObjectKey / toKubernetesObjectRef — identity encoding, _cluster sentinel
  • chunkByApplyRank / sortRefsForDelete — Namespace before Deployment on apply, reverse on delete
  • isDeploymentReady — covers observedGeneration lag, readyReplicas lag, updatedReplicas lag (rolling), fresh-status, and spec.replicas default

Live-cluster lifecycle scenarios (redeploy no-op, OOB drift recovery for replicas/image/env, OOB-delete recovery, rename-triggers-replace, double-destroy idempotency for both Deployment and Namespace) are stubbed as describe.skip blocks; this repo has no kind/minikube fixture yet, so they wire up once an EKS test cluster lands.

…cle convergence tests

Replace the single KubernetesApiError envelope with status-specific tagged
errors so retries are scoped instead of blanket-catching auth/validation
failures, wait for Deployment rollouts to converge, and wait for Namespace
finalizers before recreating downstream objects.

- Tagged errors: KubernetesNotFound (404), KubernetesConflict (409),
  KubernetesThrottled (429), KubernetesNetworkError (transport),
  KubernetesDeploymentNotReady, KubernetesDeleteNotComplete
- 429 + transport errors retry with bounded exponential backoff
- 409 on apply retries with bounded backoff (resourceVersion races and
  "namespace is being terminated" both resolve here without looping forever)
- After applying a Deployment, poll until status.observedGeneration catches
  metadata.generation and ready/updatedReplicas reach spec.replicas
- After deleting a Namespace, poll until GET 404s so a subsequent apply
  doesn't race the finalizer
- Add unit tests for path building, key/sort/chunk helpers, and
  isDeploymentReady covering observedGeneration lag, ready/updated lag,
  fresh-status, and replicas defaulting
- Stub describe.skip suites for live-cluster lifecycle (redeploy no-op,
  OOB drift recovery, OOB delete recovery, rename-replace, double-destroy
  idempotency) — wire up once an EKS test fixture lands

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alchemy-version-bot
Copy link
Copy Markdown
Contributor

alchemy-version-bot Bot commented May 5, 2026

Website Preview Deployed

URL: https://alchemyeffectwebsite-worker-pr-220-2pq6zd3sikqxayhw.testing-2b2.workers.dev

Built from commit b3fe70f.


This comment updates automatically with each push.

@sam-goodwin
Copy link
Copy Markdown
Contributor Author

Superseded by #249 (consolidated hardening sweep). Closing — the equivalent commit landed on claude/harden-all.

@sam-goodwin sam-goodwin closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant