TechAni

Zero-Downtime Platform Migration

Migrating 300+ enterprise applications from on-premises Kubernetes to Azure without a single customer-facing incident
Published: Feb 2024
Timeline: 6 weeks
Team: Platform + SRE
Scope: 300+ apps

A B2B SaaS platform with 300+ enterprise applications running on on-premises Kubernetes infrastructure with a SQL database backend needed to move to Azure. The constraint was absolute: zero customer-facing downtime during migration. Enterprise SLAs don't allow for "brief maintenance windows."

We had 6 weeks to do it. We started with a plan that looked good on paper. Then week 3 happened.

Week 3: When the Plan Fell Apart

We migrated the first 15 applications using pure SQL Server replication and a direct cutover strategy. Replication lag started unpredictable. Some data was 500ms behind. Sometimes 8 seconds. Our rollback procedure failed halfway through. For 90 minutes, nobody knew if the data was consistent.

That's when we admitted the approach was fundamentally broken. We had to stop the migration cold, manually roll back all 15 apps, and completely rethink the strategy. Nobody in production knew about it. The failures were all in our test environment. But it was the wake-up call we needed.

The Key Point

Week 3 was a crisis internally. But it was caught before it hit production. This is why we could claim zero customer-facing incidents. We failed fast during testing, pivoted, and then executed flawlessly when it mattered. The crisis taught us exactly what not to do.

0
Customer-facing incidents
0
Rollbacks after pivot
99.95%
SLA maintained
300+
Applications migrated

The Pivot: Three-Tool Approach

Liquibase for Schema Versioning

Deploy schema to Azure using version control. Every change tracked. Easy to understand what changed when. Rollback capability built in.

Controlled, auditable schema deployment to Azure

Azure DMS for Initial Data Transfer

Azure's Database Migration Service is literally built for cloud migrations. Parallel workers. Optimized for network transfers. Way faster and more reliable than trying to script it ourselves or using traditional replication.

Handles bulk transfer and edge cases out-of-the-box

Change Data Capture (CDC) for Continuous Sync

Once the initial data was transferred, CDC kept the databases in perfect sync using sub-second latency. It reads from transaction logs, so it's lightweight and reliable. We built a Python service that consumed the CDC stream and applied changes to Azure.

Lightweight, log-based, resilient to transient failures

Execution: The Actual Migration

Phase 1: Infrastructure Setup (Week 1)

Deploy schema to Azure using Liquibase. Enable CDC on critical tables in on-premises database. Set up monitoring and observability before we touch anything important.

Phase 2: Initial Data Transfer (Week 1-2)

Azure DMS handles the bulk transfer. This took 24-72 hours depending on data volume. Handled all the edge cases that would have cost us days to script.

Phase 3: Validation (Week 2–3)

Before we shifted any traffic, we validated data consistency hourly:

  • Row count comparisons on every table
  • Checksum validation on random samples
  • Spot checks on recent writes

If anything drifted, we caught it immediately. We could fix it before it became a production problem.

Phase 4: Progressive Traffic Shifting (Week 3-5)

  • Canary — 1% for 3 days; error rate stable, latency within baseline
  • Early Adopters — 10% for 5 days; customer cohort testing passing
  • Early Majority — 50% for 7 days; chaos engineering tests passing
  • Full Cutover — 100% immediate; on‑prem kept as instant fallback

Each stage used feature flags to control traffic routing. If something broke at 1%, we rolled back in seconds. If 10% was solid, we moved to 50%. This gradual approach meant we caught problems early with minimal blast radius.

Phase 5: Post-Cutover (Week 6+)

On-premises infrastructure stayed running for 30 more days as a fallback. Database replication continued for 60 days. If something went catastrophically wrong, we could fail back to on-prem in minutes. We didn't actually need to. But having that safety net meant the team could focus on shipping, not panic.

What We Actually Learned

One Tool Can't Do Everything

SQL Server replication was supposed to be our entire migration solution. It wasn't good at any single job. The moment we split the work—schema versioning, bulk transfer, continuous sync—everything became simpler and more reliable.

Validation Catches Everything

Hourly row count checks and checksum validation found problems we would have completely missed. Not because we're bad at testing, but because real production data is weird. Edge cases exist. Validation caught them before they mattered.

Feature Flags Are Essential

Progressive traffic shifting with feature flags meant we could ship confidently. If 1% broke, only a few requests were affected. By the time we hit 100%, we'd validated against massive scale with zero risk.

Test Failures Save Production Incidents

Week 3's failure was painful internally but invaluable. We caught it before production. We learned exactly what was wrong. We fixed it. Then we executed flawlessly because we'd already failed once in a safe place.

Observability First, Always

We deployed monitoring before we deployed apps. When something went wrong during traffic shifting, we saw it in seconds, not hours. The observability infrastructure was the thing that let us move fast safely.

Key Takeaways

Zero-Downtime Migration Isn't About Perfection

It's about building escape routes before you need them. Feature flags, observability, validation, rollback procedures. If any one of those failed, we would have had an incident. Having all of them meant we could move confidently.

Failing in Testing Prevents Failing in Production

Week 3's crisis was the best thing that could have happened. We failed safely. We learned exactly what was wrong. We fixed it. Then we executed flawlessly because we'd already lived through the failure once.

Progressive Delivery is More Important Than Speed

We could have forced all 300 apps over in one big bang. We would have crashed. Instead, we moved slowly. 1%, then 10%, then 50%, then 100%. Each stage validated everything at that scale. By the time we hit 100%, we'd proven it worked 4 different ways.

The Right Tools Matter

Liquibase, DMS, and CDC weren't fancy or novel. But each one was designed for the specific job it had to do. Using them meant we weren't fighting the tools. We were using them correctly.

Observability Enables Confidence

We deployed monitoring before we deployed applications. This is backwards from how most teams work. But it's what let us move fast. We could see problems in seconds. We could make decisions with real data instead of guesses.

For Your Team

If You're Planning a Migration

Don't assume one approach will work. Test your strategy on non-critical workloads first. Let it fail. Learn from the failure. Then execute on critical workloads. The time you spend failing safely will save you catastrophic failures in production.

If You're Running the Migration

Progressive delivery isn't conservative. It's confident. You're proving each stage works at real scale before moving to the next. Feature flags and observability are your safety net. Use them.

If You're Supporting the Migration

Operations needs visibility. Deploy monitoring first. Every critical path needs dashboards. Every service needs alerts. The moment something deviates from baseline, you need to know. This infrastructure is what makes fast movements safe.

Zero-downtime migrations are possible. But only if you're willing to move slowly, observe constantly, and have good fallback plans. Teams that do this don't just succeed. They do it without drama.