TechAni

Multi-Region Failover Architecture

When your US-East data center explodes, traffic seamlessly shifts to US-West. Customers don't even know. You're sleeping.

Case Study • April 10, 2025

The Problem

You run everything in US-East. One data center. All your eggs in one basket. It's fine until 3am when a cooling unit fails and the entire region goes dark.

Thousands of customers see 500 errors. Your CEO is freaking out. Your on-call engineer is scrambling to figure out which runbook to follow. Did you test failover? Nobody knows.

The old way takes 30 minutes. You manually login. Update DNS. Pray it propagates. Your sleep is ruined. Your Slack is full of angry messages.

The better way takes 2 minutes. Health checks notice US-East is down. Automatic failover kicks in. Traffic shifts to US-West. Customers see a blip. You wake up to a nice email saying the system recovered on its own. You go back to sleep.

This case study is about how to build that better way.

Why Not Just Use One Region?

Data center failures happen. Not often, but they happen. AWS had an outage in US-East-1 that lasted 12 hours. Google Cloud had a networking issue that nuked an entire region. These aren't hypotheticals. They're Tuesday.

If you're storing millions in user data and processing millions in transactions per day, you can't afford 12 hours of downtime. Your customers will leave. Your business will take a real hit.

Multi-region within the same geography is your insurance policy. It's also not that hard if you do it right.

The Architecture: Active/Active Writes (Same Region)

Here's the setup. You have three regions: US-East, US-West, US-Central. They're maybe 100 to 500ms apart. Close enough that synchronous replication actually works.

In normal operation, all traffic hits US-East. Data gets written to the US-East database. That write replicates synchronously to US-West and US-Central. They have to confirm before the write succeeds. Reads can come from any region.

When US-East dies, the health checks notice no heartbeat. DNS immediately updates to point traffic at US-West and US-Central. Since data was replicated synchronously, there's zero data loss. Everything already exists in both regions.

Users never notice. Your data is safe. Everyone goes back to work.

The Secret: Synchronous Replication Works Here

The reason this works is geography. US-East to US-West has maybe 50ms latency on a good day. Add replication confirmation and you're maybe at 100ms total. That's acceptable. User writes don't feel slower.

If you tried to replicate to Europe (300ms latency), writes would feel slow. Customers would complain. You'd be tempted to drop replication and live dangerously. Don't do that.

The key advantage: because replication is synchronous, when US-East fails, US-West already has all the data. No data loss. No eventual consistency nightmares. No conflict resolution.

The Math
Network latency between US regions is about 50ms
Add replication round-trip and you're maybe at 100ms
Write time increase feels acceptable
Data loss on failover is zero
Your life becomes much better

Health Checks: Actually Knowing When Things Break

Your health check needs to be more than just "can I reach port 5432?" It needs to actually check if the region can serve real user requests.

Every 10 seconds, from a control plane (Lambda, Kubernetes job, or a cron that calls an API), you check:

  • Can you complete a transaction (read, write, read again)
  • What's the database replication lag
  • What's the error rate in the past minute
  • Is the connection pool healthy

If US-East fails 3 out of 5 checks, it's out. DNS updates. Traffic shifts. The whole thing takes 30 to 60 seconds from failure to recovery.

// Health check that actually matters async function isRegionHealthy(region) { try { // Try a real transaction const start = Date.now(); await db.query('SELECT 1'); const latency = Date.now() - start; // Check replication lag const lag = await checkReplicationLag(region); // Check error rate const errorRate = await getErrorRate(region); // Region is healthy if all these are true: // - Latency is normal (less than 100ms) // - Replication lag is low (less than 5 seconds) // - Error rate is acceptable (less than 0.5%) return latency < 100 && lag < 5000 && errorRate < 0.005; } catch (e) { return false; } }

DNS Failover: Simple and Elegant

When health checks determine US-East is down, you update your DNS record. Instead of pointing to US-East, it points to US-West and US-Central.

Keep TTL low, like 30 to 60 seconds. Clients refresh their DNS cache frequently. Within a minute, all traffic shifts.

This works better than IP failover or load balancer magic because it's simple. CDNs respect it. Mobile apps respect it. Desktop clients respect it. Everyone shifts together.

What About Requests in Progress?

When you failover, what happens to requests that were in progress when US-East went down?

You have three choices:

  • Let them fail: Requests die. Clients retry on the new region. They go through. This is fine for most things.
  • Connection draining: Stop accepting new requests on US-East but let existing ones finish. Cleaner but harder to implement.
  • Request forwarding: Forward requests in flight to the new region. So much work for so little benefit.

Most teams just let them fail. It's fine. Clients have retry logic anyway.

Testing: Game Days Are Mandatory

You built this failover system. You've never actually tested it. Of course you haven't. Nobody wants to schedule a production failure on purpose.

Do it anyway. Monthly. Kill US-East on purpose and see what happens.

First game day, something breaks. Your DNS updater has a bug. Your health checks miss a failure mode. Your load balancer has stale connections. You find this out on a Tuesday afternoon, not at 3am during a real outage.

By the third game day, it's boring. That's when you know it works.

The Results

When you get this right:

Metric Reality
Detection Time 30-60 seconds
Failover Time 30-60 seconds (DNS propagation)
Total Downtime 1-2 minutes max
Data Loss Zero (synchronous replication)
Manual Work Needed None. Just review logs after.

Compare that to the old way (30+ minutes, possible data loss, angry customers, ruined sleep) and it's not even close.

What Actually Matters

Replication lag is everything. If US-East crashes and lag is 10 seconds, you lost 10 seconds of data. Monitor this obsessively. Alert if lag goes over 5 seconds. The moment it hits 10 seconds, you're in danger.

Health checks need to be real. Pinging a port isn't good enough. You need to run actual transactions and check actual error rates. If your health check is too loose, you'll failover for no reason. Too strict and you won't failover when you should.

Game days catch edge cases. Your failover works perfectly in theory. In practice there are race conditions. DNS caches. Load balancers holding onto connection pools. You only find these issues by actually doing it.

DNS TTL matters. Set it to 30 to 60 seconds. Some clients cache longer than they should. 30 to 60 seconds is the sweet spot between propagation speed and how clients actually behave.

What Not To Do

Don't failover to a different geographic region. Your failover is US-East to US-West. Not US-East to EU-West. Serving US users from Europe is slow. Latency gets unacceptable. Just don't do it.

Don't rely on manual failover. If you have to call an engineer who has to login and manually update DNS, you've already failed. That engineer will be asleep. Their phone is on silent. It takes 45 minutes to reach them. Customers are already angry. Automate failover.

Don't ignore replication lag. Replication lag is your blind spot. If lag is 30 seconds and you fail over, you lose 30 seconds of data. Monitor it. Alert on it. Fix it.

Don't skip game days. The first time you test failover for real will not go smoothly. You'll find bugs. You'll find assumptions that are wrong. Fix them while it doesn't matter, not during an actual outage.

Getting Started

Week 1: Set up databases in two regions. Get synchronous replication working. Test that data actually replicates. Measure the replication lag.

Week 2: Build health checks from a control plane. Make sure they actually work. Alert when they fail.

Week 3: Implement DNS failover. Start with manual triggering. You press a button, DNS updates. Verify it works.

Week 4: Automate failover. Health check triggers DNS update automatically. Add an approval gate if you're nervous about it.

Month 2: Run a game day. Kill a region on purpose. See what breaks. Fix it.

Ongoing: Run game days every month. Keep runbooks updated. Stay sharp.

Conclusion

Multi-region failover is boring infrastructure. It's not flashy. It doesn't win awards. But it means when a region fails, your customers never know. They never see an outage. You sleep through the night.

That's the real win. Not being a hero at 3am. Being asleep while your systems handle themselves.

Regional failures aren't a question of if, they're a question of when. Build the right failover and when it happens, nobody notices.