fix(gen2-migration): handle lock rollback#14764
fix(gen2-migration): handle lock rollback#147649pace wants to merge 8 commits intogen2-migrationfrom
Conversation
…et links The changeset console URL was using an incorrect path-based format that doesn't resolve in the CloudFormation console. Switch to the standard query-parameter format: #/stacks/changesets/changes?stackId=&changeSetId= Also thread the stack ARN through flattenChanges so changeset URLs include the stackId parameter needed for proper navigation.
…estedStacks Template drift detection now reads changes from EarlyValidation-failed nested changesets instead of discarding them. This enables drift detection on post-migration stacks where resources have been moved between stacks, causing CFN EarlyValidation::ResourceExistenceCheck failures. Changes: - Root changeset FAILED status falls through to analyzeChangeSet (except "didn't contain changes" which remains the no-drift happy path) - Nested changesets classify themselves: EarlyValidation failures have their Changes read; other failures are treated as genuine errors - Partial results returned with incompleteStacks tracking instead of all-or-nothing discard - Poll nested changesets to terminal status before describing (fixes race condition where root fails before all nested changesets complete) - Lock rollback validates drift with DeletionPolicy filter: recursive hasRealDrift walks the change tree, filtering out expected lock drift (Scope: ['DeletionPolicy']). Blocks rollback if real drift or incomplete stacks exist. ADR-005 documents the decision, empirical findings, and risks.
…rors, and cascading IAM drift Three fixes discovered during E2E lock + rollback testing: 1. Root changeset fall-through: Root changeset FAILED status references nested failures without containing "EarlyValidation". Add isRoot parameter so root always falls through to classify nested changesets individually. 2. Template format error as recoverable: IncludeNestedStacks fails on Amplify AppSync API stacks where model sub-stacks export names via Fn::Join. CFN reports "duplicate Export names". Changes are still populated — treat as recoverable like EarlyValidation. 3. Cascading IAM Policy changes: When DeletionPolicy changes on a DynamoDB table, CFN flags IAM policies referencing TableName.Arn as Dynamic ResourceAttribute re-evaluations. These are harmless and must be filtered alongside direct DeletionPolicy drift in lock rollback validation.
6729bf0 to
aa2522e
Compare
…m CR review Safety-critical (false negatives in rollback gate): 1. CloudFormation::Stack changes without ChangeSetId tracked as incomplete 2. Recoverable failures with empty Changes return skipped:true 3. DELETE_COMPLETE/DELETE_FAILED nested changesets tracked as incomplete Robustness: 4. extractStackNameFromArn in catch block wrapped in try-catch 5. validateLockRollbackDrift catch block properly typed 6. DELETE_FAILED added to terminal statuses for polling
…drift hasRealDrift silently skipped CloudFormation::Stack entries without nestedChanges regardless of Action. If a nested stack was added or removed outside Amplify, this produced a false negative allowing rollback when it should be blocked.
…E2E findings CR follow-up: - isExpectedLockDrift now verifies CausingEntity matches *Table.Arn or *Table.StreamArn (prevents false positives from non-DDB dynamic re-evals) - IAM Policy path requires Scope exactly ['Properties'] - ADR-005 updated: empirical findings #7 (Template format error) and #8 (cascading IAM), isRecoverableFailure pseudocode, recursive hasRealDrift, R3/R4 risk assessments expanded, console URL fix noted
Remove DELETE_COMPLETE/DELETE_FAILED handling and Stack-without-ChangeSetId guard. Both defended against scenarios that can't happen in practice (nested changesets don't get deleted during analysis, and IncludeNestedStacks always creates changeset IDs).
6d6f1e9 to
a3c8a0a
Compare
| }; | ||
| } | ||
| print.warn(`Nested changeset FAILED (recoverable): ${changeSet.StatusReason}`); | ||
| } else if (isRoot) { |
There was a problem hiding this comment.
Can the root changeset fail for a root-level reason rather than a nested changeset failure? If so, the else if (isRoot) branch would fall through, find zero changes, and return { changes: [], skipped: false }, which would report "no drift" when we actually couldn't check. Should we verify the StatusReason mentions a nested changeset before falling through, and treat other failures as skipped: true?
There was a problem hiding this comment.
Yes i assume this is possible and should be accounted for.
| ); | ||
| const terminalStatuses = ['CREATE_COMPLETE', 'FAILED']; | ||
| const NESTED_POLL_MAX_ATTEMPTS = 30; | ||
| const NESTED_POLL_INTERVAL_MS = 2000; |
There was a problem hiding this comment.
Curious, how long did this polling take in your testing? Just wondering if the sequential approach adds noticeable wait time with multiple nested stacks.
There was a problem hiding this comment.
Maybe a couple of seconds max (from what I can remember, i didn't really measure it). But I don't think that it's necessarily sequential, includeNestedStacks will try to create multiple changesets at once (I assume its like BFS) and poll for completion just exists because I started to see that sometimes if i instantly checked for the nested stack changeset completion, they would be in progress.
Description of changes
Template drift detection creates changesets with
IncludeNestedStacks: true, but currently discards any changeset that comes back as FAILED. After gen2-migration refactor moves resources between stacks, nested changesets fail EarlyValidation (ResourceExistenceCheck) or Template format error (duplicate exports fromFn::Join). CFN still populates the Changes array before these validation steps fail so the data is there, we were just throwing it away.This PR reads changes from those FAILED changesets instead of discarding them. It also handles the race condition where the root changeset fails immediately but some nested changesets are still
CREATE_IN_PROGRESSwe poll those to terminal status before reading their changes. Stacks that genuinely can't be analyzed are tracked asincompleteStacksso callers know the result is partial.On the lock side,
rollback()now validates drift before reverting DeletionPolicy changes. The lock step intentionally addsDeletionPolicy: Retainto DynamoDB tables, so those changes (and the cascading IAM Policy re-evaluations CFN generates) are expected. The filter distinguishes these from real drift that would indicate the environment was modified outside the migration flow.See ADR-005 for the full design rationale and 8 empirical findings.
Issue #, if available
#14570
Description of how you validated changes
Checklist
yarn testpassesBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.