TechAni

Scaling to 10M Daily Requests

What we learned taking a consumer platform from hitting its ceiling to handling millions of concurrent users
Published: Sep 15, 2023
Timeline: 4 months
Team: Platform + SRE
Scale: Consumer tech, millions of DAU
Growth is a good problem to have, until it isn't. Our platform was humming along, users were piling in, and then peak hours started feeling tense. Pages slowed down. Errors crept up during surge times. We weren't in crisis mode yet, but we could see the ceiling. The database was fine. The cache was fine. But together, they weren't scaling the way we needed. So we stopped firefighting and started building.

Cache Churn Cascades

Our Redis instance couldn't hold the working set. During peak hours, we'd evict hot data before it could be reused. Miss rates would spike, the database would get hammered, and suddenly everything got slower. The more users we had, the worse it got. It felt like we were fighting the system.

Hot Partitions and Bottlenecks

Certain user segments were hitting the same database nodes repeatedly. The data wasn't distributed evenly. We had capacity on other nodes but couldn't use it. It's a maddening problem because the metrics look fine until they don't, and then everything degrades at once.

340ms
P95 Latency
8.2%
Error Rate at Peak
42%
Cache Hit Rate
1.2x
Cost per Request

Layered Caching

Instead of betting everything on Redis, we built three tiers. CDN for the edge, regional caches for warm data, and Redis for the hot working set. Each layer is simple and solves one problem.

CDN 70%, edge 20%, app layer 8%, DB fallback

Safe Retries Through Idempotency

We made every write safe to retry using idempotency keys. This unlocked read scaling across replicas without fear. Reads distributed, writes safe. No more write bottlenecks.

Idempotency keys on all mutations, async write-behind patterns

Graceful Degradation

Stop fighting load. We built backpressure tied to SLO burn rates. When the system started degrading, we reject requests instead of letting them queue and cascade.

10% burn rate threshold, token bucket rate limiting

How We Actually Built It

Redis Cluster with Consistent Hashing

Single Redis was at 85% CPU. We deployed a 12-node cluster where each node owns a slice of the keyspace. Rebalancing is cheap. We could add nodes and let the cluster rebalance itself. CPU dropped to 35%. Suddenly we had breathing room for traffic spikes instead of immediate panic.

GET user_123_cart
hash(key) % 12 -> node_5
L1 cache hit: 1ms
L1 miss, Redis hit: 5ms
Both miss, database: 50-200ms
Each layer buys us capacity

Async Write-Behind for Aggregations

Expensive operations like ranking calculations were hitting the database on every write. We inverted it: write to cache immediately, return to the user, then background workers batch updates to the database. Eventual consistency with versioning handles any conflicts that arise. The write is now 5ms instead of 50ms.

POST /api/order
1. Write to cache (blocking)
2. Return 200 OK to client (5ms total)
3. Background worker flushes batch (50ms)
4. Database eventually consistent

SLO Burn Rate Autoscaling

CPU-based autoscaling was too slow. We needed something that looked ahead instead of behind. We tied concurrency limits to SLO burn rates: if we're burning more than 10% of our error budget per hour, reduce the token bucket. Preemptive instead of reactive. The system scales before users notice problems.

burn_rate = (errors_5m / requests_5m) / slo
if burn_rate > 0.1:
reduce tokens by 10%
return 429 Too Many Requests
else if burn_rate < 0.05:
increase tokens by 5%
recover capacity

Four Months of Incremental Progress

Month 1
Month 2
Month 3
Month 4
Month 1: Instrumentation
We built observability. Distributed tracing showed us where latency lived. We created SLO dashboards and identified hot partitions. This month was about understanding the problem before solving it.
Month 2: Caching Layer
Deployed CDN optimizations and built edge cache clusters. We tuned Redis sharding and implemented cache warming strategies. This was the big visibility win.
Month 3: Database Scaling
Made all write APIs idempotent. Deployed read replicas and built async write-behind patterns for heavy aggregations. Suddenly the database wasn't the bottleneck anymore.
Month 4: Resilience
Implemented adaptive concurrency limits tied to burn rates. Added circuit breakers and client retry logic. The system became graceful under load instead of catastrophic.

What Changed

When We Started

P95 Latency
340ms
Peak Error Rate
8.2%
Cache Hits
42%
Cost per Request
1.2x
P1 Incidents
~12/mo

Where We Ended Up

P95 Latency
197ms
42% improvement
Peak Error Rate
2.6%
68% improvement
Cache Hits
87%
+45 percentage points
Cost per Request
1.02x
15% savings
P1 Incidents
~1/mo
90% reduction

What We Actually Learned

Cold Starts Kill Performance

We were chasing database optimization and missed the real problem. When Redis is empty or after restarts, every request hits the database. Thundering herd effects cascaded immediately. The lesson: predictive cache warming before traffic peaks matters more than faster queries. Preventing cold cache hits beats everything.

Idempotency Changed Our Thinking

Making writes safe to retry with idempotency keys seemed like a small thing. It wasn't. Retries became free. Transient failures recovered gracefully. We went from rejecting retries to encouraging them. This pattern alone cut cascading failures dramatically.

SLO Burn Signals Are the Right Lever

CPU at 75% means nothing in context. But "burn rate exceeding 10% per hour" is unambiguous: you will violate your SLO. We switched everything to burn-rate signals. Alert fatigue disappeared. Scaling became smooth instead of thrashing around.

Simplicity at Each Layer Matters

We could have tried to build the perfect database. Instead, we added simple layers that prevented the database from being the bottleneck. Three independent cache layers. Each one simple and focused. Together, exponentially more resilient than any single optimization.