Skip to content

System Design

ConceptWhen to UseKey Trade-off
Load balancingMultiple servers handle same workloadComplexity vs throughput
CachingRead-heavy, tolerates some stalenessFreshness vs latency
Data partitioningSingle DB can’t hold all dataQuery flexibility vs horizontal scale
ReplicationNeed fault tolerance or read scaleConsistency vs availability
Message queuesDecouple producers from consumersLatency vs reliability
CDNStatic assets, geographically spreadCache invalidation vs load reduction
Circuit breakerCalling unreliable downstream servicesFail-fast vs retry cost
CQRSRead and write patterns differ greatlyOperational complexity vs query performance
Consistent hashingDistributing data across dynamic nodesRebalancing cost vs even distribution
Rate limitingProtect services from overloadUser experience vs system stability
AlgorithmHow It WorksBest For
Round-robinRotate through servers sequentiallyHomogeneous servers, stateless
Weighted round-robinRotate with proportional allocationServers with different capacity
Least connectionsSend to server with fewest activeVariable request duration
IP hashHash client IP to pick serverSession affinity without cookies
Consistent hashingHash key to ring of virtual nodesCaches, stateful partitioning
RandomPick a server at randomSimple, surprisingly effective
LayerOperates OnSeesExamplesUse Case
L4TCP/UDP packetsIP, portHAProxy, NLBRaw throughput, TLS passthrough
L7HTTP requestsURL, headers, bodynginx, ALB, EnvoyContent routing, header injection
upstream backend {
least_conn; # Algorithm selection
server 10.0.0.1:8080 weight=3;
server 10.0.0.2:8080 weight=1;
server 10.0.0.3:8080 backup; # Only when others are down
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
TypeMechanismDetects
PassiveMonitor response codesCrashed servers
ActivePeriodic probe to endpointUnhealthy but responding
DeepCheck dependencies (DB)Cascading failures
AlgorithmHow It WorksProsCons
Token bucketTokens added at fixed rate, consumed per requestAllows bursts, smooth averageRequires atomic operations
Leaky bucketRequests queue and drain at fixed rateStrict output rateDrops bursts
Fixed windowCount requests per time windowSimple to implementBoundary spike problem
Sliding windowRolling count over last N secondsAccurate, no boundary spikesHigher memory cost
StrategyWrite PathRead PathStaleness RiskUse Case
Cache-asideApp writes to DB onlyCheck cache, miss reads DBModerateGeneral purpose, read-heavy
Write-throughApp writes to cache and DBAlways read from cacheNoneStrong consistency needed
Write-behindApp writes to cache, async to DBAlways read from cacheLowWrite-heavy, can lose data
Read-throughCache fetches from DB on missAlways read from cacheModerateSimplify application code
PolicyEvictsBest For
LRULeast recently usedGeneral workloads
LFULeast frequently usedStable hot-set
FIFOOldest entryTime-series, streaming
TTLExpired entriesData with known shelf life
RandomRandom entryUniform access patterns
ApproachMechanismConsistencyComplexity
TTL expirySet expiration on writeEventualLow
Event-driven purgePublish invalidation on writeNear-realMedium
Version keyBump version to bypass stale cacheStrongMedium
Write-throughUpdate cache on every writeStrongLow
Client -> CDN Edge -> Application Cache (Redis) -> Database
~5ms ~1ms ~10-50ms

Common latency targets: CDN hit < 10ms, Redis < 2ms, DB read < 50ms.

ModelGuaranteeLatencyExample System
Strong (linearizable)Reads see the latest writeHighSpanner, ZooKeeper
SequentialAll see same order, not necessarily latestMediumDistributed locks
CausalCausally related writes orderedMediumMongoDB (sessions)
EventualAll replicas converge given timeLowDynamoDB, Cassandra
Read-your-writesClient sees its own writesLowSession-affine caches

Pick two of three during a network partition:

ChoiceSacrificeBehavior During PartitionSystems
CPAvailabilityReject writes to maintain consistencyZooKeeper, etcd, HBase
APConsistencyAccept writes, reconcile laterCassandra, DynamoDB, CouchDB
CA(No partition)Only possible on a single nodeSingle-node PostgreSQL

In practice, partitions happen. The real question: when a partition occurs, do you favor consistency or availability?

Beyond CAP — when there is no partition, trade latency vs consistency:

ScenarioTrade-offExample
PA/ELAvailable + Low latencyDynamoDB (eventual by default)
PC/ECConsistent alwaysSpanner (synchronous replication)
PA/ECAvailable in failure, consistent normallyMongoDB (default config)
TopologyWrite PathRead PathConsistencyFailover
Single leaderOne primaryPrimary + replicasStrong possiblePromote replica
Multi-leaderMultiple primariesAny nodeConflict resolution neededAutomatic
LeaderlessAny node (quorum)Any node (quorum)Tunable (R+W>N)No failover needed

Quorum formula: with N replicas, set W (write) + R (read) > N to guarantee overlap. Common config: N=3, W=2, R=2.

WorkloadGood FitWhy
OLTP, relational dataPostgreSQL, MySQLACID, joins, mature tooling
High write throughputCassandra, ScyllaDBLSM-tree, horizontal writes
Document/flexible schemaMongoDB, CouchDBSchema-per-document, easy iteration
Key-value, sub-ms readsRedis, DynamoDBIn-memory or SSD-optimized
Graph relationshipsNeo4j, Amazon NeptuneTraversal queries, no join explosion
Time-seriesTimescaleDB, InfluxDBCompression, time-windowed queries
Full-text searchElasticsearch, MeilisearchInverted index, ranking
TypeSplits ByExampleTrade-off
HorizontalRows (sharding)Users 1-1M on shard A, 1M+ on BCross-shard queries costly
VerticalColumns/featuresUser profile on DB1, orders on DB2Joins require network calls
StrategyHow It WorksProsCons
RangeSplit by key range (A-M, N-Z)Range queries stay localHotspots if distribution skewed
HashHash key mod N partitionsEven distributionRange queries span all shards
Consistent hashHash to ring, virtual nodesMinimal rebalancing on resizeComplex implementation
DirectoryLookup table maps key to partitionFlexible placementDirectory becomes bottleneck
TriggerStrategyRisk
Add nodeMove proportional slices from existingIncreased network during migration
Remove nodeDistribute orphaned data to remainingTemporary hotspots
Hotspot detectedSplit hot partition, redistributeApplication-level awareness needed
OperationTime
L1 cache reference1 ns
L2 cache reference4 ns
Main memory reference100 ns
SSD random read16 us
HDD random read4 ms
Network round trip (same DC)500 us
Network round trip (cross-region)50-150 ms
Mutex lock/unlock100 ns
ResourceThroughput
SSD sequential read500 MB/s - 3 GB/s
HDD sequential read100-200 MB/s
1 Gbps network~125 MB/s
10 Gbps network~1.25 GB/s
Postgres simple query10,000-50,000 QPS
Redis GET100,000+ QPS
nginx static file50,000+ req/s
Users: 10M total, 1M DAU
Writes per user: 5/day
Write throughput: 5M / 86,400 ≈ 58 writes/sec
Read:write ratio: 10:1 → 580 reads/sec
Object size: 1 KB average
Daily storage: 5M * 1 KB = 5 GB/day
Annual storage: 5 GB * 365 ≈ 1.8 TB/year
With replication: 1.8 TB * 3 = 5.4 TB/year
PowerValueApprox
101,0241 Thousand
201M1 Million
301B1 Billion
401T1 Trillion

Useful shortcut: 2^10 = ~10^3. So 2^30 = ~10^9 (1 billion).

PeriodSeconds
1 minute60
1 hour3,600
1 day86,400
1 month2.6M
1 year31.5M
GuaranteeBehaviorUse CaseSystems
At-most-onceSend and forget, may lose messagesMetrics, loggingUDP, fire-and-forget
At-least-onceRetry until ack, may duplicateOrder processing, eventsSQS, RabbitMQ
Exactly-onceDedup at consumer or broker levelFinancial transactionsKafka (idempotent)
ModelConsumer BehaviorProsCons
PushBroker sends to consumerLow latencyConsumer can be overwhelmed
PullConsumer polls for messagesConsumer controls pacePolling overhead, latency
TypeMessage LifetimeConsumersExample
QueueDeleted after consumptionOne consumer per messageSQS, RabbitMQ
StreamRetained for configurable periodMultiple consumers replayKafka, Kinesis
PatternProblem It SolvesTrade-off
Circuit breakerCascading failures from failed serviceFail-fast vs potential false positives
BulkheadOne component consuming all resourcesResource isolation vs utilization
SagaDistributed transactions across servicesComplexity vs data consistency
CQRSRead/write models need different schemasOperational cost vs query performance
Event sourcingAudit trail, temporal queriesStorage cost vs complete history
SidecarCross-cutting concerns (logging, auth)Extra process vs code reuse
Strangler figIncremental migration from monolithDual maintenance vs big-bang risk
BackpressureProducer faster than consumerThrottled throughput vs data loss
Retry with backoffTransient failuresRecovery time vs thundering herd
Idempotency keyDuplicate requests cause double writesStorage overhead vs data correctness
success
┌───────────────┐
v │
CLOSED ──fail──> OPEN ──timeout──> HALF-OPEN
^ │
└──────────success───────────────────┘
fail│
v
OPEN
  • Closed: Requests pass through. Track failure count.
  • Open: Requests fail immediately. Wait for timeout.
  • Half-open: Allow one test request. Success closes; failure reopens.

Saga Pattern (Choreography vs Orchestration)

Section titled “Saga Pattern (Choreography vs Orchestration)”
StyleCoordinationProsCons
ChoreographyServices emit events, peers reactLoose coupling, simpleHard to trace, debug
OrchestrationCentral coordinator directs stepsClear flow, easy to reasonSingle point of failure
┌─── Write Model ─── Event Store
Command ────────>│
└─── Projection ──── Read Model ────> Query

Separate the write path (optimized for validation, business rules) from the read path (optimized for queries, denormalized views).

SignalMeasuresExample Metric
RateRequests per secondhttp_requests_total
ErrorsFailed requests per secondhttp_errors_total
DurationTime per requesthttp_request_duration_seconds

USE Method (Resources: CPU, memory, disk, network)

Section titled “USE Method (Resources: CPU, memory, disk, network)”
SignalMeasuresExample Metric
UtilizationPercentage of resource usedcpu_usage_percent
SaturationQueue depth, waiting workdisk_io_queue_length
ErrorsResource-level error eventsnetwork_receive_errors_total
SignalWhat to MeasureAlert When
LatencyTime to serve requestsp99 exceeds SLO
TrafficRequests per secondSudden drop or spike
ErrorsRate of failed requestsError rate exceeds threshold
SaturationHow full the service isApproaching resource limits
TermDefinitionExample
SLIService Level Indicator (metric)99.2% of requests < 200ms
SLOService Level Objective (target)99.9% of requests < 200ms
SLAService Level Agreement (contract)Refund if uptime < 99.95% in a month
NinesUptimeDowntime/yearDowntime/month
99%Two nines3.65 days7.3 hours
99.9%Three8.76 hours43.8 min
99.99%Four52.6 min4.38 min
99.999%Five5.26 min26.3 sec
Client ─── Gateway ─── Service A ─── Service B ─── Database
│ │ │ │ │
└───────────┴────────────┴─────────────┴────────────┘
Trace ID: abc-123
propagated through all hops

Each service adds a span with timing, metadata, and parent span ID. Tools: Jaeger, Zipkin, OpenTelemetry.

Anti-patternProblemFix
Premature distributionAdded complexity before measuring needMeasure, then scale
Shared mutable stateDistributed locking kills throughputPartition data, use message passing
Distributed monolithMicroservices with tight couplingDefine clear contracts, deploy independently
No backpressureFast producers crash slow consumersAdd queues, rate limit upstream
Synchronous chainsA calls B calls C — latency compoundsUse async messaging where possible
Ignoring cold startEmpty caches cause thundering herd at deployCache warming, gradual traffic shift
Single point of failureOne component takes down the whole systemRedundancy at every layer
Optimistic capacity plan”We’ll scale later” with no headroomPlan for 3-5x peak, test at 2x