CloudFormation's failure model has two modes — the default is "roll everything back", and the override is "preserve what worked". Knowing the difference matters because the default sometimes destroys hours of partial progress on a complex stack.
## Default behavior: full rollback
When any resource in a stack operation fails:
- During **create**: CFN deletes every resource it successfully provisioned. Stack ends in `ROLLBACK_COMPLETE`. The next create attempt starts from zero.
- During **update**: CFN reverts every resource it modified back to pre-update config. Stack ends in `UPDATE_ROLLBACK_COMPLETE`.
This is safe but expensive. A stack with 80 resources where the 79th fails throws away 78 successful operations.
## The override: Preserve successfully provisioned resources
CLI flag: `--disable-rollback` (or `--on-failure DO_NOTHING` for `create-stack`). Console: "Preserve successfully provisioned resources" checkbox in stack options.
With this enabled, on failure:
- Successful resources stay (`CREATE_COMPLETE` / `UPDATE_COMPLETE`)
- Failed resources stay in `CREATE_FAILED` / `UPDATE_FAILED`
- Stack status: `CREATE_FAILED` or `UPDATE_FAILED` (not rolled back)
Then you have three operational choices:
| Action | When | What happens |
|--------|------|--------------|
| **Retry** | Failure was transient or external (IAM eventual consistency, throttling) | CFN retries failed resources, no template change |
| **Update** | Failure was a template error you've now fixed | CFN re-attempts with new template; failed resources retry, successful ones update if the template differs |
| **Roll back** | Give up on this attempt | CFN reverts to last known stable state |
**Constraint**: when updating a stack already in `FAILED`, you must also pass `--disable-rollback` again, or CFN will roll back instead.
**Constraint**: change sets can be created against `CREATE_FAILED` or `UPDATE_FAILED` stacks — but **not** against `UPDATE_ROLLBACK_FAILED`.
## Parallel provisioning, independent failure paths
CFN identifies dependency relationships and provisions independent paths in parallel. A failure in one path does not stop other paths — they continue until completion or their own failure. So a "failed" stack operation may have several resources still in `IN_PROGRESS` when the first failure is reported. Wait for all paths to settle before deciding next action.
## Alarm-driven rollback (rollback triggers)
You can attach **CloudWatch alarms** as rollback triggers. If any specified alarm enters `ALARM` state during the monitoring window, CFN aborts and rolls back. This catches application-level regressions invisible to CFN itself (latency spike, error rate jump).
CLI:
```bash
aws cloudformation update-stack --stack-name MyStack \
--use-previous-template \
--rollback-configuration \
"RollbackTriggers=[{Arn=arn:aws:cloudwatch:us-east-1:...:alarm:MyAlarm,
Type=AWS::CloudWatch::Alarm}],MonitoringTimeInMinutes=10"
```
- Monitoring window: 0–180 minutes (default 0 for stack ops, 5 for change sets)
- Composite alarms (`AWS::CloudWatch::CompositeAlarm`) are supported — useful for "any of N signals trip"
- **If a referenced alarm is missing, the entire stack op fails immediately** — keep the alarm ARNs current
- CFN needs IAM permission to read CloudWatch metric data
## The bad state: `UPDATE_ROLLBACK_FAILED`
This happens when CFN's rollback itself fails. Classic cause: CFN tries to roll back to a previous resource state that no longer exists (someone deleted the original RDS instance out-of-band).
A stack in `UPDATE_ROLLBACK_FAILED` cannot be updated. Recovery path:
1. Fix the underlying error if possible (recreate the missing resource manually)
2. Run `continue-update-rollback`
3. If specific resources can't be rolled back, use `--resources-to-skip` to mark them `UPDATE_COMPLETE` and proceed
```bash
aws cloudformation continue-update-rollback --stack-name WebInfra \
--resources-to-skip myCustom WebInfra-Compute-Asg.myAsg
```
For nested-stack resources: format is `NestedStackName.ResourceLogicalID`. To skip a nested stack itself (`AWS::CloudFormation::Stack`), the embedded stack must be in `DELETE_IN_PROGRESS`, `DELETE_COMPLETE`, or `DELETE_FAILED`.
**Critical caveat**: skipped resources are *marked* `UPDATE_COMPLETE` but their actual state diverges from the template. Subsequent updates may fail and the stack may become permanently unrecoverable. Always reconcile skipped resources before the next stack operation — either update the template to match reality, or fix the resource to match the template.
## Root-cause detection
The console's "Detect root cause" button on the Events tab analyzes failures and tags the most likely root-cause event with a "Likely root cause" label, with optional CloudTrail event linkage. Doesn't work for nested stacks. For non-console workflows, parse stack events programmatically and walk dependency relationships from the first `CREATE_FAILED` / `UPDATE_FAILED`.
## Decision: when to enable preserve-on-failure
Default it on for:
- Long-running stack creates (>10 minutes)
- Stacks with stateful resources where partial success is valuable
- Iterative template development (you'll re-run many times)
Leave default rollback for:
- Production updates where partial state would be worse than no update
- Small stacks where re-create cost is low
## Related
- [[CFN Update Behaviors and the Replacement Trap]]
- [[CFN Change Sets Preview-Execute]]
- [[CFN Drift Detection Mechanics and Limits]]