Scaling to 10M Daily Requests

Growth is a good problem to have, until it isn't. Our platform was humming along, users were piling in, and then peak hours started feeling tense. Pages slowed down. Errors crept up during surge times. We weren't in crisis mode yet, but we could see the ceiling. The database was fine. The cache was fine. But together, they weren't scaling the way we needed. So we stopped firefighting and started building.

Cache Churn Cascades

Our Redis instance couldn't hold the working set. During peak hours, we'd evict hot data before it could be reused. Miss rates would spike, the database would get hammered, and suddenly everything got slower. The more users we had, the worse it got. It felt like we were fighting the system.

Hot Partitions and Bottlenecks

Certain user segments were hitting the same database nodes repeatedly. The data wasn't distributed evenly. We had capacity on other nodes but couldn't use it. It's a maddening problem because the metrics look fine until they don't, and then everything degrades at once.

340ms

P95 Latency

8.2%

Error Rate at Peak

42%

Cache Hit Rate

1.2x

Cost per Request

Layered Caching

Instead of betting everything on Redis, we built three tiers. CDN for the edge, regional caches for warm data, and Redis for the hot working set. Each layer is simple and solves one problem.

CDN 70%, edge 20%, app layer 8%, DB fallback

Safe Retries Through Idempotency

We made every write safe to retry using idempotency keys. This unlocked read scaling across replicas without fear. Reads distributed, writes safe. No more write bottlenecks.

Idempotency keys on all mutations, async write-behind patterns

Graceful Degradation

Stop fighting load. We built backpressure tied to SLO burn rates. When the system started degrading, we reject requests instead of letting them queue and cascade.

10% burn rate threshold, token bucket rate limiting

How We Actually Built It

Redis Cluster with Consistent Hashing

Single Redis was at 85% CPU. We deployed a 12-node cluster where each node owns a slice of the keyspace. Rebalancing is cheap. We could add nodes and let the cluster rebalance itself. CPU dropped to 35%. Suddenly we had breathing room for traffic spikes instead of immediate panic.

GET user_123_cart

hash(key) % 12 -> node_5

L1 cache hit: 1ms

L1 miss, Redis hit: 5ms

Both miss, database: 50-200ms

Each layer buys us capacity

Async Write-Behind for Aggregations

Expensive operations like ranking calculations were hitting the database on every write. We inverted it: write to cache immediately, return to the user, then background workers batch updates to the database. Eventual consistency with versioning handles any conflicts that arise. The write is now 5ms instead of 50ms.

POST /api/order

Write to cache (blocking)

Return 200 OK to client (5ms total)

Background worker flushes batch (50ms)

Database eventually consistent

SLO Burn Rate Autoscaling

CPU-based autoscaling was too slow. We needed something that looked ahead instead of behind. We tied concurrency limits to SLO burn rates: if we're burning more than 10% of our error budget per hour, reduce the token bucket. Preemptive instead of reactive. The system scales before users notice problems.

burn_rate = (errors_5m / requests_5m) / slo

if burn_rate > 0.1:

  reduce tokens by 10%

  return 429 Too Many Requests

else if burn_rate < 0.05:

  increase tokens by 5%

  recover capacity

Four Months of Incremental Progress

Month 1

Month 2

Month 3

Month 4

Month 1: Instrumentation

We built observability. Distributed tracing showed us where latency lived. We created SLO dashboards and identified hot partitions. This month was about understanding the problem before solving it.

Month 2: Caching Layer

Deployed CDN optimizations and built edge cache clusters. We tuned Redis sharding and implemented cache warming strategies. This was the big visibility win.

Month 3: Database Scaling

Made all write APIs idempotent. Deployed read replicas and built async write-behind patterns for heavy aggregations. Suddenly the database wasn't the bottleneck anymore.

Month 4: Resilience

Implemented adaptive concurrency limits tied to burn rates. Added circuit breakers and client retry logic. The system became graceful under load instead of catastrophic.

What Changed

When We Started

P95 Latency

340ms

Peak Error Rate

8.2%

Cache Hits

42%

Cost per Request

1.2x

P1 Incidents

~12/mo

Where We Ended Up

P95 Latency

197ms

42% improvement

Peak Error Rate

2.6%

68% improvement

Cache Hits

87%

+45 percentage points

Cost per Request

1.02x

15% savings

P1 Incidents

~1/mo

90% reduction

What We Actually Learned

Cold Starts Kill Performance

We were chasing database optimization and missed the real problem. When Redis is empty or after restarts, every request hits the database. Thundering herd effects cascaded immediately. The lesson: predictive cache warming before traffic peaks matters more than faster queries. Preventing cold cache hits beats everything.

Idempotency Changed Our Thinking

Making writes safe to retry with idempotency keys seemed like a small thing. It wasn't. Retries became free. Transient failures recovered gracefully. We went from rejecting retries to encouraging them. This pattern alone cut cascading failures dramatically.

SLO Burn Signals Are the Right Lever

CPU at 75% means nothing in context. But "burn rate exceeding 10% per hour" is unambiguous: you will violate your SLO. We switched everything to burn-rate signals. Alert fatigue disappeared. Scaling became smooth instead of thrashing around.

Simplicity at Each Layer Matters

We could have tried to build the perfect database. Instead, we added simple layers that prevented the database from being the bottleneck. Three independent cache layers. Each one simple and focused. Together, exponentially more resilient than any single optimization.