Skip to content

fix(aws/autoscaling): harden AutoScalingGroup reconcile + add lifecycle convergence tests#227

Open
sam-goodwin wants to merge 1 commit intomainfrom
claude/harden-autoscaling-group
Open

fix(aws/autoscaling): harden AutoScalingGroup reconcile + add lifecycle convergence tests#227
sam-goodwin wants to merge 1 commit intomainfrom
claude/harden-autoscaling-group

Conversation

@sam-goodwin
Copy link
Copy Markdown
Contributor

Hardens the AutoScalingGroup resource as part of the per-resource hardening sweep on top of the reconcile migration.

Reconciler changes

- Effect.catch((error: any) =>
-   error?._tag === "AlreadyExistsFault"
-     ? Effect.void
-     : Effect.fail(error),
- ),
+ Effect.catchTag("AlreadyExistsFault", () => Effect.void),

Blanket Effect.catch swallowed every error. Scoped to AlreadyExistsFault only — race / restart-after-crash recovery — so auth, throttling, and validation errors propagate.

  existing = yield* describeGroup(autoScalingGroupName).pipe(
-   Effect.filterOrFail(Boolean, () => new Error(...)),
+   Effect.flatMap((group) =>
+     group ? Effect.succeed(group)
+           : Effect.fail(new AutoScalingGroupNotReadyAfterCreate()),
+   ),
    Effect.retry({
-     while: () => true,
+     while: (e) => e._tag === "AutoScalingGroupNotReadyAfterCreate",
      schedule: Schedule.recurs(8).pipe(...),
    }),
  );

The post-create read previously retried on any error including auth/validation. Now retries only the actual "still missing in describe" case.

+ .pipe(
+   Effect.retry({
+     while: (e) =>
+       e._tag === "ScalingActivityInProgressFault" ||
+       e._tag === "ResourceContentionFault",
+     schedule: Schedule.fixed("2 seconds").pipe(
+       Schedule.both(Schedule.recurs(15)),
+     ),
+   }),
+ );

updateAutoScalingGroup and deleteAutoScalingGroup now retry on transient ScalingActivityInProgressFault and ResourceContentionFault. Delete also retries on ResourceInUseFault. Attach/Detach LB target groups retry on the errors their distilled unions actually carry (ResourceContention, InstanceRefreshInProgress).

+ const attrs = toAttributes(group);
+ return (yield* hasAlchemyTags(id, attrs.tags))
+   ? attrs
+   : Unowned(attrs);

read now marks foreign ASGs as Unowned, gating takeover behind adopt(true) instead of silently re-tagging.

New lifecycle tests

packages/alchemy/test/AWS/AutoScaling/AutoScalingGroup.test.ts. Skipped unless TEST_LAUNCH_TEMPLATE_ID and TEST_SUBNET_IDS are set (real EC2 fleet required).

  • redeploy with same props is a no-op
  • reconcile resets desiredCapacity mutated out-of-band via setDesiredCapacity
  • changing autoScalingGroupName triggers replace
  • in-place modification of minSize/maxSize/desiredCapacity
  • destroying an already-deleted ASG is a no-op
  • adopt(true) re-tags a foreign ASG

Distilled patch

No patch needed for this pass. ScalingActivityInProgressFault is genuinely transient and a candidate for withRetryableError in distilled, but the reconciler-level retry handles it for now without depending on the AWS retry layer. InvalidInstanceLifecycleStateException is context-dependent and stays BadRequestError + ConflictError only.

@alchemy-version-bot
Copy link
Copy Markdown
Contributor

Install the packages built from this commit:

alchemy

bun add alchemy@https://pkg.ing/alchemy/0f419cd

@alchemy.run/better-auth

bun add @alchemy.run/better-auth@https://pkg.ing/@alchemy.run/better-auth/0f419cd

@alchemy.run/pr-package

bun add @alchemy.run/pr-package@https://pkg.ing/@alchemy.run/pr-package/0f419cd

@alchemy-version-bot
Copy link
Copy Markdown
Contributor

alchemy-version-bot Bot commented May 5, 2026

Website Preview Deployed

URL: https://alchemyeffectwebsite-worker-pr-227-eog3u7er6dtvr2om.testing-2b2.workers.dev

Built from commit 0f419cd.


This comment updates automatically with each push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant