Back
Solutions Architecture

Multi-Region DR Platform

Multi-Region DR Platform architecture diagram

Summary

(7 min read)

A fully Terraform-provisioned multi-region DR platform on AWS demonstrating production-grade disaster recovery architecture. Primary region (us-east-1), warm standby (us-west-2), and a control plane (us-east-2) orchestrating failover via Step Functions. Includes a React operator dashboard for one-click failover and weekly automated DR drills.

Project Snapshot

My Role

Solutions Architect & Cloud Engineer

Duration

4 weeks · 2024

Context

Personal Lab Project

Outcome

< 15 min RTO · automated weekly DR drills · one-click failover dashboard

Stack

AWS ECSRDSS3Step FunctionsEventBridgeLambdaAPI GatewayRoute 53CloudWatchTerraform

The Problem

Context

Most AWS architectures treat disaster recovery as an afterthought — bolted on after everything is already in production. This project explores what a properly architected multi-region DR platform looks like from the ground up.

The Pain

No tested DR plan is the same as no DR plan. Manually-operated failover under pressure is slow, error-prone, and frequently fails on the details that were never practiced.

Why It Mattered

For any production system, unplanned downtime has direct revenue and reputational consequences. The goal is a system where failover is a boring, practiced, automated process — not a crisis.

Goals & Requirements

Technical Goals

  • RTO under 15 minutes measured end-to-end
  • Automated weekly DR drill with CloudWatch outcome metrics
  • One-click failover via an operator dashboard
  • 100% IaC — zero manual AWS console steps anywhere
  • Cross-region RDS read replica promoting automatically on failover

Constraints

  • Lab budget — cost-conscious design, stop/start patterns for non-critical resources
  • All Terraform — reusable modules for each infrastructure layer

Architecture Design

Three-region architecture: control plane in us-east-2 (Step Functions + EventBridge), primary in us-east-1 (ECS + RDS + S3), warm standby in us-west-2 (ECS + RDS replica + S3 replication). Route 53 health-check-based DNS failover switches traffic automatically.

Architecture Diagram

Scroll horizontally on smaller screens to view full diagram

Component Breakdown

AWS Step Functions

Orchestrates the failover workflow — promotes RDS replica, updates Route 53, scales standby ECS service to full capacity

EventBridge

Schedules weekly automated DR drills and routes CloudWatch alarms to the failover workflow

Amazon ECS

Runs the reference application in both primary and standby regions

Amazon RDS

Primary database in us-east-1 with a cross-region read replica in us-west-2 promoted on failover

Route 53

Health-check-based DNS failover — automatically shifts traffic between primary and standby regions

API Gateway + Lambda

Backend for the operator dashboard — triggers the failover workflow and queries replication lag metrics

Key Design Decisions

Warm standby over pilot light

Pilot light has lower running cost but requires scale-up time under pressure — exactly the wrong moment to be waiting for capacity. Warm standby keeps a minimal-capacity ECS service running in the DR region, enabling sub-15-minute RTO without a capacity race during a real incident.

Step Functions for failover orchestration

Failover involves multiple sequential async steps with error handling and rollback. Step Functions makes the workflow visible, auditable, and retryable — far better than a Lambda calling other Lambdas with no execution history.

Implementation Breakdown

01

Terraform Module Architecture

Seven reusable Terraform modules — networking, compute, database, storage, monitoring, control-plane, and bastion. Each environment stack composes these modules.

  • Remote state backend in S3 + DynamoDB state locking
  • Separate state files per module — limits blast radius of a bad apply
  • Dedicated CI/CD Terraform stack for plan/apply via GitHub Actions
  • All three regions provisioned from the same module definitions with region-specific variables
02

Failover Workflow

Step Functions state machine executes the failover sequence with compensating transactions on failure.

  • Step 1: Promote RDS read replica to standalone primary
  • Step 2: Poll until RDS instance status is 'available' (async wait loop)
  • Step 3: Update ECS task definition with new DB endpoint
  • Step 4: Scale standby ECS service to full production capacity
  • Step 5: Update Route 53 weighted routing to shift 100% traffic to DR region
  • Step 6: Emit DR outcome metric to CloudWatch

Challenges & Solutions

#1RDS promotion timing in Step Functions

The Problem

RDS replica promotion is asynchronous and takes 5–10 minutes. The initial state machine failed because it proceeded to update the application config before the promoted instance was actually available — causing connection errors.

The Fix

Added an explicit polling loop using a Step Functions Wait state + Lambda that checks the RDS instance status every 30 seconds. The workflow only advances once the promoted instance reaches 'available' state. DR drill timing improved from failing to consistent sub-12-minute end-to-end.

Results & Impact

A production-grade multi-region DR architecture fully codified in Terraform with automated weekly testing.

Before → After

Recovery Time Objective

Unknown< 15 min
Measurable & tested

DR Drill Process

Manual / never runAutomated weekly
Zero-touch

Infrastructure

Manual console100% Terraform
Fully reproducible

Business Outcome

A reference architecture demonstrating multi-region DR thinking — a reusable template for any AWS project requiring high availability and tested recovery across regions.

Reflections

Would Do Differently

  • 01Use Aurora Global Database instead of RDS read replica — faster promotion time and lower replication lag across regions
  • 02Add cost tagging from day one to get accurate per-component spend tracking across the three regions

Key Takeaways

  • 01Failover workflows need explicit async wait states — Step Functions polling is the right pattern, not a timer-based sleep
  • 02Weekly automated drills are more valuable than a perfect runbook — the drill finds the gaps the runbook misses, every time

Next Project

Cloud Governance Platform

Cloud Engineering

Cloud Governance Platform

Thanks for Reading