Skip to content

Resilience

PrincipleOne-liner
Everything failsDesign for failure, not against it
Blast radiusLimit how far a failure can spread
Steady-state hypothesisDefine “normal” before breaking things
Graceful degradationServe partial results over complete failure
Fail-fastDetect failure early, propagate it immediately
Defense in depthNo single mechanism prevents all failures
Recovery > preventionMean time to recovery matters more than time to failure

Systems fail not because individual components are fragile, but because failures compose. A slow database query triggers a thread pool exhaustion that causes a health check timeout that triggers a cascading restart.

TypeDescriptionExample
Crash failureProcess diesOOM kill, unhandled exception
Omission failureComponent fails to send or receiveDropped packets, full queue
Timing failureResponse arrives outside expected windowGC pause, cold cache after deploy
Byzantine failureComponent behaves arbitrarily (including lying)Corrupted memory, compromised node
Gray failureDegraded but still passing health checksSlow disk, partial network partition
Cascading failureOne failure triggers failures in dependent systemsDB slow → thread exhaustion → timeout
Correlated failureSingle cause affects multiple componentsAZ outage, shared dependency, bad push

Gray failures are the most dangerous — the system appears healthy to monitoring while users experience degraded service.

DB slow (50ms → 2s)
→ App thread pool fills (waiting for DB)
→ Requests queue at load balancer
→ Health checks timeout
→ LB removes "unhealthy" servers
→ Remaining servers receive ALL traffic
→ They overload too
→ Total outage

Breaking the cascade: timeout each stage, shed load early, maintain backpressure.

PatternProtects AgainstMechanismTrade-off
Circuit breakerCascading failureStop calling failed dependencyFail-fast vs missed recovery
BulkheadResource exhaustionIsolate resource pools per serviceUtilization vs isolation
TimeoutUnbounded waitsFail after deadlineResponsiveness vs false failure
Retry + backoffTransient failuresRetry with exponential delay + jitterRecovery vs thundering herd
Retry budgetRetry stormsCap total retries across all callersRecovery vs amplification
Rate limitingOverloadReject requests beyond thresholdStability vs user experience
Load sheddingOverload at capacityDrop low-priority work firstAvailability vs completeness
FallbackDependency failureReturn cached/default responseFreshness vs availability
IdempotencyDuplicate requestsSame input → same effectStorage vs correctness
Read timeout = p99 of dependency + buffer
Connect timeout = much shorter (100-500ms)
Total timeout = end-user patience minus upstream hops
Rule of thumb:
- Internal service call: 1-5s
- Database query: 5-30s
- External API: 5-10s
- User-facing request: 200ms-2s total budget

Set timeouts at every network boundary. A missing timeout is an unbounded wait that will surface during the worst possible moment.

StrategyFormulaUse Case
Fixed delayWait N seconds between retriesSimple, predictable
Exponential backoffWait 2^attempt secondsStandard for most services
Exponential + jitter2^attempt * random(0.5, 1.5)Prevents thundering herd
Retry budgetMax N% of requests can be retriesPrevents retry storms

Never retry non-idempotent operations without an idempotency key. Never retry 4xx errors (client bug, not transient). Always set a maximum retry count.

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production.

1. Define steady state — what does "working" look like?
(Error rate < 0.1%, p99 < 200ms, orders processing)
2. Hypothesize — "If we kill one Redis node, the system
continues serving from the replica within 5 seconds"
3. Inject failure — kill the node, inject latency, drop packets
4. Observe — did the system maintain steady state?
5. Learn — if not, fix and re-test. If yes, increase blast radius.
TechniqueWhat It TestsTools
Kill process/containerCrash recovery, restart policieskill -9, Docker stop
Network latencyTimeout handling, circuit breakerstc netem, Toxiproxy
Packet loss/corruptionRetry logic, error handlingtc netem, Chaos Mesh
DNS failureFallback resolution, cachingBlock DNS, /etc/hosts
Disk fullGraceful degradation, alertingfallocate, Litmus
Clock skewLease expiry, token validationfaketime, Chaos Mesh
AZ/region failureMulti-AZ failoverDisable AZ routing
CPU/memory pressureAutoscaling, OOM handlingstress-ng, Chaos Monkey
Dependency slowdownTimeout + fallback behaviorToxiproxy, Envoy fault injection

Start small and expand:

LevelScopeExample
1. DevLocal environmentKill a container in Docker Compose
2. StagingNon-productionInject latency into staging load balancer
3. CanarySmall % of prodFail 1% of requests to a single service
4. ProdFull productionKill an AZ during business hours

Most teams never need level 4. Levels 1-3 catch the majority of resilience gaps.

A game day is a scheduled chaos experiment with the full team:

  1. Announce — everyone knows it is happening, no surprise outages
  2. Define scope — which systems, what failures, what is off-limits
  3. Run the experiment — inject failure, observe dashboards
  4. Respond as in production — oncall triages, team communicates
  5. Debrief — what broke, what held, what do we fix

Game days test not just the system but the team’s incident response.

Resilience requires visibility. You cannot recover from what you cannot detect.

SignalDetectsLatency
Synthetic probesEndpoint down, degraded responseSeconds
Error rate spikeBug, dependency failure1-5 minutes
Latency shiftGray failure, resource pressure1-5 minutes
Saturation alertApproaching capacity limitsMinutes
Log anomalyUnusual patterns, new error typesVaries
Customer reportsAnything monitoring missedHours (too late)

Alert on symptoms (error rate, latency), not causes (CPU, disk). Cause-based alerts generate noise; symptom-based alerts catch problems regardless of root cause.

Define an error budget from your SLO, then alert when you are burning through it too fast:

SLO: 99.9% availability (43.8 min/month budget)
Burn rate alert thresholds:
- 14.4x burn rate over 1h → page (budget gone in 5 hours)
- 6x burn rate over 6h → page (budget gone in 12 hours)
- 1x burn rate over 3 days → ticket (on track to miss SLO)
Anti-patternProblemFix
Retry without backoffAmplifies load during failuresExponential backoff with jitter
Retry without budgetEvery caller retries, creating a retry stormCap retries as % of total traffic
No timeoutThread/connection held indefinitelyTimeout at every network boundary
Health check that liesReturns 200 while dependencies are downDeep health checks that test critical paths
Chaos without steady stateBreaking things without knowing what “good” isDefine metrics baseline before injecting chaos
Testing only in stagingStaging never matches production topologyGraduate chaos from staging to canary to prod
Ignoring gray failuresSystem “up” but degraded for subset of usersSynthetic probes, percentile-based alerting
Single point of failureOne component takes everything downRedundancy at every layer, test failover
Manual failoverDepends on human speed at 3 a.m.Automate failover, test it with chaos
  • System Design — Circuit breaker states, load balancing, caching patterns
  • Debugging — Scientific method applied to failures
  • Specification — Defining what “correct” means before testing resilience
  • Orchestration — Failure modes in multi-agent and distributed coordination
  • Testing — Testing pyramid, integration tests that exercise failure paths
  • AI Safety — Defense in depth and failure modes applied to LLM-powered systems