A fully Terraform-provisioned multi-region DR platform on AWS demonstrating production-grade disaster recovery architecture. Primary region (us-east-1), warm standby (us-west-2), and a control plane (us-east-2) orchestrating failover via Step Functions. Includes a React operator dashboard for one-click failover and weekly automated DR drills.
My Role
Solutions Architect & Cloud Engineer
Duration
4 weeks · 2024
Context
Personal Lab Project
Outcome
< 15 min RTO · automated weekly DR drills · one-click failover dashboard
Stack
Context
Most AWS architectures treat disaster recovery as an afterthought — bolted on after everything is already in production. This project explores what a properly architected multi-region DR platform looks like from the ground up.
The Pain
No tested DR plan is the same as no DR plan. Manually-operated failover under pressure is slow, error-prone, and frequently fails on the details that were never practiced.
Why It Mattered
For any production system, unplanned downtime has direct revenue and reputational consequences. The goal is a system where failover is a boring, practiced, automated process — not a crisis.
Technical Goals
Constraints
Three-region architecture: control plane in us-east-2 (Step Functions + EventBridge), primary in us-east-1 (ECS + RDS + S3), warm standby in us-west-2 (ECS + RDS replica + S3 replication). Route 53 health-check-based DNS failover switches traffic automatically.
Scroll horizontally on smaller screens to view full diagram
AWS Step Functions
Orchestrates the failover workflow — promotes RDS replica, updates Route 53, scales standby ECS service to full capacity
EventBridge
Schedules weekly automated DR drills and routes CloudWatch alarms to the failover workflow
Amazon ECS
Runs the reference application in both primary and standby regions
Amazon RDS
Primary database in us-east-1 with a cross-region read replica in us-west-2 promoted on failover
Route 53
Health-check-based DNS failover — automatically shifts traffic between primary and standby regions
API Gateway + Lambda
Backend for the operator dashboard — triggers the failover workflow and queries replication lag metrics
→Warm standby over pilot light
Pilot light has lower running cost but requires scale-up time under pressure — exactly the wrong moment to be waiting for capacity. Warm standby keeps a minimal-capacity ECS service running in the DR region, enabling sub-15-minute RTO without a capacity race during a real incident.
→Step Functions for failover orchestration
Failover involves multiple sequential async steps with error handling and rollback. Step Functions makes the workflow visible, auditable, and retryable — far better than a Lambda calling other Lambdas with no execution history.
Seven reusable Terraform modules — networking, compute, database, storage, monitoring, control-plane, and bastion. Each environment stack composes these modules.
Step Functions state machine executes the failover sequence with compensating transactions on failure.
The Problem
RDS replica promotion is asynchronous and takes 5–10 minutes. The initial state machine failed because it proceeded to update the application config before the promoted instance was actually available — causing connection errors.
The Fix
Added an explicit polling loop using a Step Functions Wait state + Lambda that checks the RDS instance status every 30 seconds. The workflow only advances once the promoted instance reaches 'available' state. DR drill timing improved from failing to consistent sub-12-minute end-to-end.
A production-grade multi-region DR architecture fully codified in Terraform with automated weekly testing.
Before → After
Recovery Time Objective
DR Drill Process
Infrastructure
Business Outcome
A reference architecture demonstrating multi-region DR thinking — a reusable template for any AWS project requiring high availability and tested recovery across regions.
Would Do Differently
Key Takeaways
Next Project
Cloud Engineering