CFN Failure Rollback Behavior - Nestor G Pestelos Jr (ngpestelos)

CloudFormation's failure model has two modes — the default is "roll everything back", and the override is "preserve what worked". Knowing the difference matters because the default sometimes destroys hours of partial progress on a complex stack. ## Default behavior: full rollback When any resource in a stack operation fails: - During **create**: CFN deletes every resource it successfully provisioned. Stack ends in `ROLLBACK_COMPLETE`. The next create attempt starts from zero. - During **update**: CFN reverts every resource it modified back to pre-update config. Stack ends in `UPDATE_ROLLBACK_COMPLETE`. This is safe but expensive. A stack with 80 resources where the 79th fails throws away 78 successful operations. ## The override: Preserve successfully provisioned resources CLI flag: `--disable-rollback` (or `--on-failure DO_NOTHING` for `create-stack`). Console: "Preserve successfully provisioned resources" checkbox in stack options. With this enabled, on failure: - Successful resources stay (`CREATE_COMPLETE` / `UPDATE_COMPLETE`) - Failed resources stay in `CREATE_FAILED` / `UPDATE_FAILED` - Stack status: `CREATE_FAILED` or `UPDATE_FAILED` (not rolled back) Then you have three operational choices: | Action | When | What happens | |--------|------|--------------| | **Retry** | Failure was transient or external (IAM eventual consistency, throttling) | CFN retries failed resources, no template change | | **Update** | Failure was a template error you've now fixed | CFN re-attempts with new template; failed resources retry, successful ones update if the template differs | | **Roll back** | Give up on this attempt | CFN reverts to last known stable state | **Constraint**: when updating a stack already in `FAILED`, you must also pass `--disable-rollback` again, or CFN will roll back instead. **Constraint**: change sets can be created against `CREATE_FAILED` or `UPDATE_FAILED` stacks — but **not** against `UPDATE_ROLLBACK_FAILED`. ## Parallel provisioning, independent failure paths CFN identifies dependency relationships and provisions independent paths in parallel. A failure in one path does not stop other paths — they continue until completion or their own failure. So a "failed" stack operation may have several resources still in `IN_PROGRESS` when the first failure is reported. Wait for all paths to settle before deciding next action. ## Alarm-driven rollback (rollback triggers) You can attach **CloudWatch alarms** as rollback triggers. If any specified alarm enters `ALARM` state during the monitoring window, CFN aborts and rolls back. This catches application-level regressions invisible to CFN itself (latency spike, error rate jump). CLI: ```bash aws cloudformation update-stack --stack-name MyStack \ --use-previous-template \ --rollback-configuration \ "RollbackTriggers=[{Arn=arn:aws:cloudwatch:us-east-1:...:alarm:MyAlarm, Type=AWS::CloudWatch::Alarm}],MonitoringTimeInMinutes=10" ``` - Monitoring window: 0–180 minutes (default 0 for stack ops, 5 for change sets) - Composite alarms (`AWS::CloudWatch::CompositeAlarm`) are supported — useful for "any of N signals trip" - **If a referenced alarm is missing, the entire stack op fails immediately** — keep the alarm ARNs current - CFN needs IAM permission to read CloudWatch metric data ## The bad state: `UPDATE_ROLLBACK_FAILED` This happens when CFN's rollback itself fails. Classic cause: CFN tries to roll back to a previous resource state that no longer exists (someone deleted the original RDS instance out-of-band). A stack in `UPDATE_ROLLBACK_FAILED` cannot be updated. Recovery path: 1. Fix the underlying error if possible (recreate the missing resource manually) 2. Run `continue-update-rollback` 3. If specific resources can't be rolled back, use `--resources-to-skip` to mark them `UPDATE_COMPLETE` and proceed ```bash aws cloudformation continue-update-rollback --stack-name WebInfra \ --resources-to-skip myCustom WebInfra-Compute-Asg.myAsg ``` For nested-stack resources: format is `NestedStackName.ResourceLogicalID`. To skip a nested stack itself (`AWS::CloudFormation::Stack`), the embedded stack must be in `DELETE_IN_PROGRESS`, `DELETE_COMPLETE`, or `DELETE_FAILED`. **Critical caveat**: skipped resources are *marked* `UPDATE_COMPLETE` but their actual state diverges from the template. Subsequent updates may fail and the stack may become permanently unrecoverable. Always reconcile skipped resources before the next stack operation — either update the template to match reality, or fix the resource to match the template. ## Root-cause detection The console's "Detect root cause" button on the Events tab analyzes failures and tags the most likely root-cause event with a "Likely root cause" label, with optional CloudTrail event linkage. Doesn't work for nested stacks. For non-console workflows, parse stack events programmatically and walk dependency relationships from the first `CREATE_FAILED` / `UPDATE_FAILED`. ## Decision: when to enable preserve-on-failure Default it on for: - Long-running stack creates (>10 minutes) - Stacks with stateful resources where partial success is valuable - Iterative template development (you'll re-run many times) Leave default rollback for: - Production updates where partial state would be worse than no update - Small stacks where re-create cost is low ## Related - [[CFN Update Behaviors and the Replacement Trap]] - [[CFN Change Sets Preview-Execute]] - [[CFN Drift Detection Mechanics and Limits]]