Skip to content

fix(gen2-migration): handle lock rollback#14764

Open
9pace wants to merge 8 commits intogen2-migrationfrom
paceben/14570-include-nested-stacks-adr
Open

fix(gen2-migration): handle lock rollback#14764
9pace wants to merge 8 commits intogen2-migrationfrom
paceben/14570-include-nested-stacks-adr

Conversation

@9pace
Copy link
Copy Markdown

@9pace 9pace commented Apr 8, 2026

Description of changes

Template drift detection creates changesets with IncludeNestedStacks: true, but currently discards any changeset that comes back as FAILED. After gen2-migration refactor moves resources between stacks, nested changesets fail EarlyValidation (ResourceExistenceCheck) or Template format error (duplicate exports from Fn::Join). CFN still populates the Changes array before these validation steps fail so the data is there, we were just throwing it away.

This PR reads changes from those FAILED changesets instead of discarding them. It also handles the race condition where the root changeset fails immediately but some nested changesets are still CREATE_IN_PROGRESS we poll those to terminal status before reading their changes. Stacks that genuinely can't be analyzed are tracked as incompleteStacks so callers know the result is partial.

On the lock side, rollback() now validates drift before reverting DeletionPolicy changes. The lock step intentionally adds DeletionPolicy: Retain to DynamoDB tables, so those changes (and the cascading IAM Policy re-evaluations CFN generates) are expected. The filter distinguishes these from real drift that would indicate the environment was modified outside the migration flow.

See ADR-005 for the full design rationale and 8 empirical findings.

Issue #, if available

#14570

Description of how you validated changes

  • 30 unit tests (lock rollback drift validation, DeletionPolicy filter, cascading IAM, nested tree walk, edge cases)
  • 7 E2E scenarios against a deployed Amplify app (FAILED changeset reading, filter correctness, negative tests with injected drift)
  • TypeScript compiles clean

Checklist

  • PR description included
  • yarn test passes
  • Tests are changed or added
  • Relevant documentation is changed or added (and PR referenced)
  • New AWS SDK calls or CloudFormation actions have been added to relevant test and service IAM policies
  • Pull request labels are added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@9pace 9pace marked this pull request as ready for review April 8, 2026 20:48
@9pace 9pace requested a review from a team as a code owner April 8, 2026 20:48
@9pace 9pace requested a review from sai-ray April 8, 2026 20:49
9pace added 3 commits April 9, 2026 17:59
…et links

The changeset console URL was using an incorrect path-based format that
doesn't resolve in the CloudFormation console. Switch to the standard
query-parameter format: #/stacks/changesets/changes?stackId=&changeSetId=

Also thread the stack ARN through flattenChanges so changeset URLs
include the stackId parameter needed for proper navigation.
…estedStacks

Template drift detection now reads changes from EarlyValidation-failed
nested changesets instead of discarding them. This enables drift detection
on post-migration stacks where resources have been moved between stacks,
causing CFN EarlyValidation::ResourceExistenceCheck failures.

Changes:
- Root changeset FAILED status falls through to analyzeChangeSet (except
  "didn't contain changes" which remains the no-drift happy path)
- Nested changesets classify themselves: EarlyValidation failures have
  their Changes read; other failures are treated as genuine errors
- Partial results returned with incompleteStacks tracking instead of
  all-or-nothing discard
- Poll nested changesets to terminal status before describing (fixes
  race condition where root fails before all nested changesets complete)
- Lock rollback validates drift with DeletionPolicy filter: recursive
  hasRealDrift walks the change tree, filtering out expected lock drift
  (Scope: ['DeletionPolicy']). Blocks rollback if real drift or
  incomplete stacks exist.

ADR-005 documents the decision, empirical findings, and risks.
…rors, and cascading IAM drift

Three fixes discovered during E2E lock + rollback testing:

1. Root changeset fall-through: Root changeset FAILED status references nested
   failures without containing "EarlyValidation". Add isRoot parameter so root
   always falls through to classify nested changesets individually.

2. Template format error as recoverable: IncludeNestedStacks fails on Amplify
   AppSync API stacks where model sub-stacks export names via Fn::Join. CFN
   reports "duplicate Export names". Changes are still populated — treat as
   recoverable like EarlyValidation.

3. Cascading IAM Policy changes: When DeletionPolicy changes on a DynamoDB
   table, CFN flags IAM policies referencing TableName.Arn as Dynamic
   ResourceAttribute re-evaluations. These are harmless and must be filtered
   alongside direct DeletionPolicy drift in lock rollback validation.
@9pace 9pace force-pushed the paceben/14570-include-nested-stacks-adr branch from 6729bf0 to aa2522e Compare April 9, 2026 19:05
9pace added 4 commits April 9, 2026 19:11
…m CR review

Safety-critical (false negatives in rollback gate):
1. CloudFormation::Stack changes without ChangeSetId tracked as incomplete
2. Recoverable failures with empty Changes return skipped:true
3. DELETE_COMPLETE/DELETE_FAILED nested changesets tracked as incomplete

Robustness:
4. extractStackNameFromArn in catch block wrapped in try-catch
5. validateLockRollbackDrift catch block properly typed
6. DELETE_FAILED added to terminal statuses for polling
…drift

hasRealDrift silently skipped CloudFormation::Stack entries without
nestedChanges regardless of Action. If a nested stack was added or
removed outside Amplify, this produced a false negative allowing
rollback when it should be blocked.
…E2E findings

CR follow-up:

- isExpectedLockDrift now verifies CausingEntity matches *Table.Arn or
  *Table.StreamArn (prevents false positives from non-DDB dynamic re-evals)
- IAM Policy path requires Scope exactly ['Properties']
- ADR-005 updated: empirical findings #7 (Template format error) and #8
  (cascading IAM), isRecoverableFailure pseudocode, recursive hasRealDrift,
  R3/R4 risk assessments expanded, console URL fix noted
Remove DELETE_COMPLETE/DELETE_FAILED handling and Stack-without-ChangeSetId
guard. Both defended against scenarios that can't happen in practice
(nested changesets don't get deleted during analysis, and
IncludeNestedStacks always creates changeset IDs).
@9pace 9pace force-pushed the paceben/14570-include-nested-stacks-adr branch from 6d6f1e9 to a3c8a0a Compare April 9, 2026 19:44
};
}
print.warn(`Nested changeset FAILED (recoverable): ${changeSet.StatusReason}`);
} else if (isRoot) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the root changeset fail for a root-level reason rather than a nested changeset failure? If so, the else if (isRoot) branch would fall through, find zero changes, and return { changes: [], skipped: false }, which would report "no drift" when we actually couldn't check. Should we verify the StatusReason mentions a nested changeset before falling through, and treat other failures as skipped: true?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i assume this is possible and should be accounted for.

);
const terminalStatuses = ['CREATE_COMPLETE', 'FAILED'];
const NESTED_POLL_MAX_ATTEMPTS = 30;
const NESTED_POLL_INTERVAL_MS = 2000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, how long did this polling take in your testing? Just wondering if the sequential approach adds noticeable wait time with multiple nested stacks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a couple of seconds max (from what I can remember, i didn't really measure it). But I don't think that it's necessarily sequential, includeNestedStacks will try to create multiple changesets at once (I assume its like BFS) and poll for completion just exists because I started to see that sometimes if i instantly checked for the nested stack changeset completion, they would be in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants