Decision ledger

A one-page table for architecture trade-offs.

When the meeting needs a clean answer, use this ledger: pattern, pressure, toll.

RoutePressureToll
01Primary-Replica (Leader-Follower)Scaling read traffic without sending every query to the write leaderReplication lag can make replica reads stale; synchronous replication reduces lag but slows writes
02Sharding (Horizontal Partitioning)A single database growing beyond one machine’s write or storage limitsCross-shard queries, hot keys, and resharding are operationally painful
03Consistent HashingRedistributing too much data when nodes join or leave a clusterMore complex than modulo hashing and still needs virtual nodes to balance load
04Write-Ahead Log (WAL)Recovering durable writes after a crash without corrupting storageExtra write amplification and log compaction/retention work
05Event SourcingNeeding a perfect audit trail and the ability to reconstruct old statesSchema evolution, replay cost, and unbounded event growth require discipline
06CQRS (Command Query Responsibility Segregation)Read needs and write invariants fighting over the same modelTwo models, projection lag, and more moving pieces
07Cache-Aside (Lazy Loading)Avoiding repeated database reads for frequently requested dataFirst request is slow and cache invalidation must be handled carefully
08Write-ThroughKeeping cache and storage fresh immediately after writesWrites are slower and unused values may occupy cache
09Write-Behind (Write-Back)Absorbing high write volume with low perceived latencyA crash before flush can lose data and the database intentionally lags
10Read-ThroughKeeping application code free of cache-loading logicCache infrastructure becomes coupled to the data source and query semantics
11Cache Stampede PreventionPreventing one expired hot key from stampeding the databaseLocks, jitter, and stale-while-revalidate logic add operational complexity
12Request-Response (Synchronous)Asking another component for an immediate answerLatency and failures propagate directly through synchronous call chains
13Message Queue (Asynchronous)Decoupling producers from slower or unreliable consumersResults are delayed and delivery semantics/order must be designed
14Publish-Subscribe (Pub/Sub)Letting many services react to the same business eventDuplicates and ordering differences require idempotent subscribers
15Event-Driven ArchitectureReducing direct service coupling across a workflowDebugging and latency become distributed across logs, queues, and consumers
16WebhooksNotifying an external system exactly when something happensThe receiver must be reachable, verify signatures, retry safely, and handle duplicates
17Server-Sent Events (SSE)Pushing server updates to browsers without full duplex complexityServer-to-client only and browser connection limits apply
18Bidirectional Streaming (WebSockets / gRPC Streaming)Supporting continuous two-way real-time interactionMillions of open connections need specialized routing, backpressure, and reconnect logic
19Circuit BreakerStopping a failing dependency from exhausting callersFallbacks may be degraded and thresholds must fit real failure modes
20Retry with Exponential BackoffHandling transient failures without giving up immediatelyPoorly bounded retries amplify outages and increase tail latency
21BulkheadKeeping one workload or tenant from sinking the whole serviceReserved pools can reduce utilization when traffic is uneven
22TimeoutPreventing slow dependencies from tying up resources foreverToo short causes false failures; too long delays recovery
23IdempotencyMaking retries safe when clients cannot know if a write succeededRequires storing keys/results and checking duplicates on the write path
24Dead Letter Queue (DLQ)Stopping poison messages from blocking the main queue foreverDlqs need ownership, alerts, replay tooling, and cleanup
25Graceful DegradationServing something useful when a noncritical subsystem breaksDegraded modes must be designed and tested before outages happen
26Horizontal ScalingHandling more stateless request volume by adding machinesNeeds load balancing and externalized session/state storage
27Vertical ScalingGetting more capacity quickly from a single-node componentHard physical ceiling, larger blast radius, and diminishing returns
28Load BalancingSpreading incoming requests across healthy backendsHealth checks, uneven workloads, stickiness, and overload handling matter
29Auto-ScalingMatching capacity to variable traffic without manual interventionScaling reacts with delay and can hide inefficient code or create cost surprises
30Database Connection PoolingAvoiding expensive database connection setup per requestToo few connections queue requests; too many overload the database
31MapReduceProcessing huge datasets that cannot fit on one machineHigh latency and operational overhead compared with streaming for realtime needs
32Stream ProcessingReacting to data continuously instead of waiting for batch jobsOrdering, replay, watermarks, and exactly-once semantics are hard
33Lambda ArchitectureCombining accurate batch views with low-latency realtime viewsDuplicated logic and reconciliation complexity
34Change Data Capture (CDC)Letting other systems react to database changes reliablySchema changes, ordering, replay, and backfills need care
35API GatewayGiving clients one stable doorway into many backend servicesGateway misconfiguration can become a bottleneck or single failure point
36Backend for Frontend (BFF)Serving different client experiences without one bloated apiMore api surfaces and potential duplicated business logic
37Rate LimitingProtecting services from abusive or accidental request floodsLegitimate bursts can be throttled if limits are too blunt
38Pagination (Cursor-Based)Returning large changing lists without skips or duplicate surprisesHarder than offset paging and requires stable ordering
39API VersioningEvolving an api without breaking existing clientsOld versions create maintenance burden and migration planning
40CDN (Content Delivery Network)Serving static content with low latency worldwideCache invalidation, stale content, and dynamic personalization boundaries
41Reverse ProxyPutting common web concerns in front of application serversIncorrect headers/routing can hide client identity or create difficult bugs
42Service MeshStandardizing service-to-service behavior across many teamsOperational complexity and another layer to debug
43Sidecar PatternAdding cross-cutting behavior without modifying the main appResource overhead and lifecycle coupling with the main service
44Two-Phase Commit (2PC)Committing one transaction atomically across multiple participantsBlocking behavior, coordinator failure modes, and poor fit for long workflows
45Saga PatternCoordinating distributed work without one global transactionCompensation is business-specific and final state is eventually consistent
46QuorumTuning consistency and availability in replicated systemsHigher quorum counts increase latency and reduce availability during failures
47Vector ClocksDetecting causal ordering without a global clockMetadata grows with node count and conflicts still need resolution policy
48Health Check EndpointLetting infrastructure know whether a service should receive trafficShallow checks miss real failures; deep checks can overload dependencies
49Distributed TracingSeeing where time and failures go across a distributed requestSampling, context propagation, and cardinality must be managed
50Canary DeploymentReducing deployment risk by exposing new code graduallyRequires traffic splitting, compatible versions, and strong metrics
51Outbox PatternReliably publishing events after a database writeRequires a relay, dedupe, monitoring, and cleanup of old outbox rows
52Inbox PatternProcessing incoming messages exactly once from the consumer perspectiveConsumer storage and idempotency logic become part of the contract
53Transactional MessagingCoordinating local state changes with external messagesEventual consistency and operational repair paths must be explicit
54Compensating TransactionUndoing a multi-step workflow when one step failsCompensation may be partial, delayed, or business-specific rather than a true undo
55Materialized ViewServing expensive read shapes without recomputing them on every requestViews lag source-of-truth data and need rebuild/replay procedures
56CQRS ProjectionKeeping write models clean while serving many read modelsProjection drift, replay cost, and schema evolution need careful operations
57Read RepairHealing stale replicas during normal readsReads become slightly more complex and stale data can still leak briefly
58Hinted HandoffHandling writes when a replica is temporarily unavailableHint buildup can create recovery storms and needs retention limits
59Anti-Entropy RepairConverging replicas after missed writes or partitionsRepair jobs consume io and must be paced to avoid user impact
60Leader ElectionChoosing one active coordinator without split brainClock assumptions, lease expiry, and failover behavior must be designed carefully
61Distributed LockSerializing access to shared work across nodesLocks can expire mid-work; correctness needs fencing tokens or idempotency
62LeaseGranting temporary ownership without permanent locksClock skew and renewal pauses can cause overlapping owners
63Fencing TokenPreventing an old owner from writing after a newer owner appearsEvery protected resource must validate tokens for the guarantee to hold
64Work QueueDistributing background work across many workersRequires backpressure, poison-message handling, and idempotent jobs
65Priority QueueLetting urgent work bypass routine backlogLow-priority starvation and priority inflation need controls
66Fan-Out / Fan-InParallelizing many independent subtasks and aggregating resultsTail latency, partial failure, and result ordering become explicit concerns
67Scatter-GatherQuerying multiple providers or shards at onceSlow or failed branches need deadlines, fallbacks, and partial response semantics
68Hedged RequestsReducing tail latency from straggler instancesExtra load can amplify incidents if hedging is not capped
69Request CoalescingPreventing many identical requests from doing duplicate workThe coalescer becomes a hot path and must isolate failures
70SingleflightCollapsing identical work inside one processOnly helps per process unless paired with distributed coordination
71Token BucketAllowing bursts while enforcing an average rateBurst size and refill rate must match real capacity
72Leaky BucketSmoothing bursty input into steady outputQueues add latency and overflow policy matters
73Adaptive Concurrency LimitFinding safe concurrency without static limitsBad feedback signals can oscillate or over-throttle
74BackpressurePreventing fast producers from overwhelming slow consumersRequires a policy for what gets delayed, dropped, or degraded
75Load SheddingProtecting core service during overloadUser-visible errors are intentional; prioritization must be defensible
76BrownoutReducing optional work before the whole service failsRequires knowing which work is optional and testing degraded paths
77Fail-FastAvoiding wasted work when success is unlikelyCan be too aggressive without good health signals
78Fallback CacheServing acceptable stale data when live dependencies failStaleness must be visible and bounded
79Multi-Region Active-ActiveServing writes and reads from more than one regionConflict resolution, data residency, and operational complexity rise sharply
80Active-Passive FailoverRecovering from a primary site failureRpo/rto depend on replication and rehearsed runbooks
81Cell-Based ArchitectureContaining blast radius as a platform growsCapacity balancing and cross-cell operations get harder
82Shuffle ShardingReducing how many tenants share the same failure domainRouting and capacity math are more complex
83Static StabilitySurviving dependency failure without immediate scaling or coordinationCosts more up front and requires discipline not to rely on emergency scaling
84Strangler FigReplacing legacy systems incrementallyRouting, data synchronization, and cutover criteria must be explicit
85Branch by AbstractionChanging implementations without long-lived feature branchesThe abstraction can leak or become permanent if not retired
86Parallel RunValidating a new system against an old oneDouble-running increases cost and comparison logic must handle legitimate differences
87Shadow TrafficTesting a new service with production-shaped input safelyPrivacy, side effects, and amplified load must be controlled
88Feature FlagChanging behavior without redeployingFlag debt and inconsistent states need lifecycle management
89Blue-Green DeploymentCutting over between two complete environmentsRequires duplicate capacity and careful database compatibility
90Rolling DeploymentUpdating a fleet gradually without full downtimeMixed versions must be compatible during the rollout
91Schema VersioningChanging data contracts without breaking old readersOld versions and migration states must be actively retired
92Expand-Contract MigrationChanging schemas while old and new code overlapTakes multiple deployments and careful observability
93TombstoneRepresenting deletes safely in replicated/evented systemsTombstones consume storage and retention must exceed replication lag
94Soft DeleteAllowing recovery and audit after deletionQueries must consistently filter deleted data; privacy rules may require hard delete
95Data Retention WindowBounding storage, privacy, and replay obligationsRetention must reconcile legal, product, and operational needs
96Audit LogExplaining who changed what and whenAudit data is sensitive and must be protected from tampering
97Policy Decision Point / Policy Enforcement PointSeparating authorization decisions from enforcement locationsLatency and availability of policy checks become critical
98Secret RotationChanging credentials without downtimeEvery consumer must be discoverable and rotation must be rehearsed
99Envelope EncryptionProtecting data with manageable key rotationKey hierarchy, access control, and recovery procedures add complexity
100Control Plane / Data Plane SplitKeeping management operations separate from request servingControl-plane outages must not immediately stop stable data-plane traffic