A B2B SaaS platform with 300+ enterprise applications running on on-premises Kubernetes infrastructure with a SQL database backend needed to move to Azure. The constraint was absolute: zero customer-facing downtime during migration. Enterprise SLAs don't allow for "brief maintenance windows."
We had 6 weeks to do it. We started with a plan that looked good on paper. Then week 3 happened.
Week 3: When the Plan Fell Apart
We migrated the first 15 applications using pure SQL Server replication and a direct cutover strategy. Replication lag started unpredictable. Some data was 500ms behind. Sometimes 8 seconds. Our rollback procedure failed halfway through. For 90 minutes, nobody knew if the data was consistent.
That's when we admitted the approach was fundamentally broken. We had to stop the migration cold, manually roll back all 15 apps, and completely rethink the strategy. Nobody in production knew about it. The failures were all in our test environment. But it was the wake-up call we needed.
The Key Point
Week 3 was a crisis internally. But it was caught before it hit production. This is why we could claim zero customer-facing incidents. We failed fast during testing, pivoted, and then executed flawlessly when it mattered. The crisis taught us exactly what not to do.
The Pivot: Three-Tool Approach
Liquibase for Schema Versioning
Deploy schema to Azure using version control. Every change tracked. Easy to understand what changed when. Rollback capability built in.
Azure DMS for Initial Data Transfer
Azure's Database Migration Service is literally built for cloud migrations. Parallel workers. Optimized for network transfers. Way faster and more reliable than trying to script it ourselves or using traditional replication.
Change Data Capture (CDC) for Continuous Sync
Once the initial data was transferred, CDC kept the databases in perfect sync using sub-second latency. It reads from transaction logs, so it's lightweight and reliable. We built a Python service that consumed the CDC stream and applied changes to Azure.
Execution: The Actual Migration
Phase 1: Infrastructure Setup (Week 1)
Deploy schema to Azure using Liquibase. Enable CDC on critical tables in on-premises database. Set up monitoring and observability before we touch anything important.
Phase 2: Initial Data Transfer (Week 1-2)
Azure DMS handles the bulk transfer. This took 24-72 hours depending on data volume. Handled all the edge cases that would have cost us days to script.
Phase 3: Validation (Week 2–3)
Before we shifted any traffic, we validated data consistency hourly:
- Row count comparisons on every table
- Checksum validation on random samples
- Spot checks on recent writes
If anything drifted, we caught it immediately. We could fix it before it became a production problem.
Phase 4: Progressive Traffic Shifting (Week 3-5)
- Canary — 1% for 3 days; error rate stable, latency within baseline
- Early Adopters — 10% for 5 days; customer cohort testing passing
- Early Majority — 50% for 7 days; chaos engineering tests passing
- Full Cutover — 100% immediate; on‑prem kept as instant fallback
Each stage used feature flags to control traffic routing. If something broke at 1%, we rolled back in seconds. If 10% was solid, we moved to 50%. This gradual approach meant we caught problems early with minimal blast radius.
Phase 5: Post-Cutover (Week 6+)
On-premises infrastructure stayed running for 30 more days as a fallback. Database replication continued for 60 days. If something went catastrophically wrong, we could fail back to on-prem in minutes. We didn't actually need to. But having that safety net meant the team could focus on shipping, not panic.
What We Actually Learned
One Tool Can't Do Everything
SQL Server replication was supposed to be our entire migration solution. It wasn't good at any single job. The moment we split the work—schema versioning, bulk transfer, continuous sync—everything became simpler and more reliable.
Validation Catches Everything
Hourly row count checks and checksum validation found problems we would have completely missed. Not because we're bad at testing, but because real production data is weird. Edge cases exist. Validation caught them before they mattered.
Feature Flags Are Essential
Progressive traffic shifting with feature flags meant we could ship confidently. If 1% broke, only a few requests were affected. By the time we hit 100%, we'd validated against massive scale with zero risk.
Test Failures Save Production Incidents
Week 3's failure was painful internally but invaluable. We caught it before production. We learned exactly what was wrong. We fixed it. Then we executed flawlessly because we'd already failed once in a safe place.
Observability First, Always
We deployed monitoring before we deployed apps. When something went wrong during traffic shifting, we saw it in seconds, not hours. The observability infrastructure was the thing that let us move fast safely.
Key Takeaways
Zero-Downtime Migration Isn't About Perfection
It's about building escape routes before you need them. Feature flags, observability, validation, rollback procedures. If any one of those failed, we would have had an incident. Having all of them meant we could move confidently.
Failing in Testing Prevents Failing in Production
Week 3's crisis was the best thing that could have happened. We failed safely. We learned exactly what was wrong. We fixed it. Then we executed flawlessly because we'd already lived through the failure once.
Progressive Delivery is More Important Than Speed
We could have forced all 300 apps over in one big bang. We would have crashed. Instead, we moved slowly. 1%, then 10%, then 50%, then 100%. Each stage validated everything at that scale. By the time we hit 100%, we'd proven it worked 4 different ways.
The Right Tools Matter
Liquibase, DMS, and CDC weren't fancy or novel. But each one was designed for the specific job it had to do. Using them meant we weren't fighting the tools. We were using them correctly.
Observability Enables Confidence
We deployed monitoring before we deployed applications. This is backwards from how most teams work. But it's what let us move fast. We could see problems in seconds. We could make decisions with real data instead of guesses.
For Your Team
If You're Planning a Migration
Don't assume one approach will work. Test your strategy on non-critical workloads first. Let it fail. Learn from the failure. Then execute on critical workloads. The time you spend failing safely will save you catastrophic failures in production.
If You're Running the Migration
Progressive delivery isn't conservative. It's confident. You're proving each stage works at real scale before moving to the next. Feature flags and observability are your safety net. Use them.
If You're Supporting the Migration
Operations needs visibility. Deploy monitoring first. Every critical path needs dashboards. Every service needs alerts. The moment something deviates from baseline, you need to know. This infrastructure is what makes fast movements safe.
Zero-downtime migrations are possible. But only if you're willing to move slowly, observe constantly, and have good fallback plans. Teams that do this don't just succeed. They do it without drama.
