Decision ledger

A one-page table for architecture trade-offs.

When the meeting needs a clean answer, use this ledger: pattern, pressure, toll.

Route	Pressure	Toll
01Primary-Replica (Leader-Follower)	Scaling read traffic without sending every query to the write leader	Replication lag can make replica reads stale; synchronous replication reduces lag but slows writes
02Sharding (Horizontal Partitioning)	A single database growing beyond one machine’s write or storage limits	Cross-shard queries, hot keys, and resharding are operationally painful
03Consistent Hashing	Redistributing too much data when nodes join or leave a cluster	More complex than modulo hashing and still needs virtual nodes to balance load
04Write-Ahead Log (WAL)	Recovering durable writes after a crash without corrupting storage	Extra write amplification and log compaction/retention work
05Event Sourcing	Needing a perfect audit trail and the ability to reconstruct old states	Schema evolution, replay cost, and unbounded event growth require discipline
06CQRS (Command Query Responsibility Segregation)	Read needs and write invariants fighting over the same model	Two models, projection lag, and more moving pieces
07Cache-Aside (Lazy Loading)	Avoiding repeated database reads for frequently requested data	First request is slow and cache invalidation must be handled carefully
08Write-Through	Keeping cache and storage fresh immediately after writes	Writes are slower and unused values may occupy cache
09Write-Behind (Write-Back)	Absorbing high write volume with low perceived latency	A crash before flush can lose data and the database intentionally lags
10Read-Through	Keeping application code free of cache-loading logic	Cache infrastructure becomes coupled to the data source and query semantics
11Cache Stampede Prevention	Preventing one expired hot key from stampeding the database	Locks, jitter, and stale-while-revalidate logic add operational complexity
12Request-Response (Synchronous)	Asking another component for an immediate answer	Latency and failures propagate directly through synchronous call chains
13Message Queue (Asynchronous)	Decoupling producers from slower or unreliable consumers	Results are delayed and delivery semantics/order must be designed
14Publish-Subscribe (Pub/Sub)	Letting many services react to the same business event	Duplicates and ordering differences require idempotent subscribers
15Event-Driven Architecture	Reducing direct service coupling across a workflow	Debugging and latency become distributed across logs, queues, and consumers
16Webhooks	Notifying an external system exactly when something happens	The receiver must be reachable, verify signatures, retry safely, and handle duplicates
17Server-Sent Events (SSE)	Pushing server updates to browsers without full duplex complexity	Server-to-client only and browser connection limits apply
18Bidirectional Streaming (WebSockets / gRPC Streaming)	Supporting continuous two-way real-time interaction	Millions of open connections need specialized routing, backpressure, and reconnect logic
19Circuit Breaker	Stopping a failing dependency from exhausting callers	Fallbacks may be degraded and thresholds must fit real failure modes
20Retry with Exponential Backoff	Handling transient failures without giving up immediately	Poorly bounded retries amplify outages and increase tail latency
21Bulkhead	Keeping one workload or tenant from sinking the whole service	Reserved pools can reduce utilization when traffic is uneven
22Timeout	Preventing slow dependencies from tying up resources forever	Too short causes false failures; too long delays recovery
23Idempotency	Making retries safe when clients cannot know if a write succeeded	Requires storing keys/results and checking duplicates on the write path
24Dead Letter Queue (DLQ)	Stopping poison messages from blocking the main queue forever	Dlqs need ownership, alerts, replay tooling, and cleanup
25Graceful Degradation	Serving something useful when a noncritical subsystem breaks	Degraded modes must be designed and tested before outages happen
26Horizontal Scaling	Handling more stateless request volume by adding machines	Needs load balancing and externalized session/state storage
27Vertical Scaling	Getting more capacity quickly from a single-node component	Hard physical ceiling, larger blast radius, and diminishing returns
28Load Balancing	Spreading incoming requests across healthy backends	Health checks, uneven workloads, stickiness, and overload handling matter
29Auto-Scaling	Matching capacity to variable traffic without manual intervention	Scaling reacts with delay and can hide inefficient code or create cost surprises
30Database Connection Pooling	Avoiding expensive database connection setup per request	Too few connections queue requests; too many overload the database
31MapReduce	Processing huge datasets that cannot fit on one machine	High latency and operational overhead compared with streaming for realtime needs
32Stream Processing	Reacting to data continuously instead of waiting for batch jobs	Ordering, replay, watermarks, and exactly-once semantics are hard
33Lambda Architecture	Combining accurate batch views with low-latency realtime views	Duplicated logic and reconciliation complexity
34Change Data Capture (CDC)	Letting other systems react to database changes reliably	Schema changes, ordering, replay, and backfills need care
35API Gateway	Giving clients one stable doorway into many backend services	Gateway misconfiguration can become a bottleneck or single failure point
36Backend for Frontend (BFF)	Serving different client experiences without one bloated api	More api surfaces and potential duplicated business logic
37Rate Limiting	Protecting services from abusive or accidental request floods	Legitimate bursts can be throttled if limits are too blunt
38Pagination (Cursor-Based)	Returning large changing lists without skips or duplicate surprises	Harder than offset paging and requires stable ordering
39API Versioning	Evolving an api without breaking existing clients	Old versions create maintenance burden and migration planning
40CDN (Content Delivery Network)	Serving static content with low latency worldwide	Cache invalidation, stale content, and dynamic personalization boundaries
41Reverse Proxy	Putting common web concerns in front of application servers	Incorrect headers/routing can hide client identity or create difficult bugs
42Service Mesh	Standardizing service-to-service behavior across many teams	Operational complexity and another layer to debug
43Sidecar Pattern	Adding cross-cutting behavior without modifying the main app	Resource overhead and lifecycle coupling with the main service
44Two-Phase Commit (2PC)	Committing one transaction atomically across multiple participants	Blocking behavior, coordinator failure modes, and poor fit for long workflows
45Saga Pattern	Coordinating distributed work without one global transaction	Compensation is business-specific and final state is eventually consistent
46Quorum	Tuning consistency and availability in replicated systems	Higher quorum counts increase latency and reduce availability during failures
47Vector Clocks	Detecting causal ordering without a global clock	Metadata grows with node count and conflicts still need resolution policy
48Health Check Endpoint	Letting infrastructure know whether a service should receive traffic	Shallow checks miss real failures; deep checks can overload dependencies
49Distributed Tracing	Seeing where time and failures go across a distributed request	Sampling, context propagation, and cardinality must be managed
50Canary Deployment	Reducing deployment risk by exposing new code gradually	Requires traffic splitting, compatible versions, and strong metrics
51Outbox Pattern	Reliably publishing events after a database write	Requires a relay, dedupe, monitoring, and cleanup of old outbox rows
52Inbox Pattern	Processing incoming messages exactly once from the consumer perspective	Consumer storage and idempotency logic become part of the contract
53Transactional Messaging	Coordinating local state changes with external messages	Eventual consistency and operational repair paths must be explicit
54Compensating Transaction	Undoing a multi-step workflow when one step fails	Compensation may be partial, delayed, or business-specific rather than a true undo
55Materialized View	Serving expensive read shapes without recomputing them on every request	Views lag source-of-truth data and need rebuild/replay procedures
56CQRS Projection	Keeping write models clean while serving many read models	Projection drift, replay cost, and schema evolution need careful operations
57Read Repair	Healing stale replicas during normal reads	Reads become slightly more complex and stale data can still leak briefly
58Hinted Handoff	Handling writes when a replica is temporarily unavailable	Hint buildup can create recovery storms and needs retention limits
59Anti-Entropy Repair	Converging replicas after missed writes or partitions	Repair jobs consume io and must be paced to avoid user impact
60Leader Election	Choosing one active coordinator without split brain	Clock assumptions, lease expiry, and failover behavior must be designed carefully
61Distributed Lock	Serializing access to shared work across nodes	Locks can expire mid-work; correctness needs fencing tokens or idempotency
62Lease	Granting temporary ownership without permanent locks	Clock skew and renewal pauses can cause overlapping owners
63Fencing Token	Preventing an old owner from writing after a newer owner appears	Every protected resource must validate tokens for the guarantee to hold
64Work Queue	Distributing background work across many workers	Requires backpressure, poison-message handling, and idempotent jobs
65Priority Queue	Letting urgent work bypass routine backlog	Low-priority starvation and priority inflation need controls
66Fan-Out / Fan-In	Parallelizing many independent subtasks and aggregating results	Tail latency, partial failure, and result ordering become explicit concerns
67Scatter-Gather	Querying multiple providers or shards at once	Slow or failed branches need deadlines, fallbacks, and partial response semantics
68Hedged Requests	Reducing tail latency from straggler instances	Extra load can amplify incidents if hedging is not capped
69Request Coalescing	Preventing many identical requests from doing duplicate work	The coalescer becomes a hot path and must isolate failures
70Singleflight	Collapsing identical work inside one process	Only helps per process unless paired with distributed coordination
71Token Bucket	Allowing bursts while enforcing an average rate	Burst size and refill rate must match real capacity
72Leaky Bucket	Smoothing bursty input into steady output	Queues add latency and overflow policy matters
73Adaptive Concurrency Limit	Finding safe concurrency without static limits	Bad feedback signals can oscillate or over-throttle
74Backpressure	Preventing fast producers from overwhelming slow consumers	Requires a policy for what gets delayed, dropped, or degraded
75Load Shedding	Protecting core service during overload	User-visible errors are intentional; prioritization must be defensible
76Brownout	Reducing optional work before the whole service fails	Requires knowing which work is optional and testing degraded paths
77Fail-Fast	Avoiding wasted work when success is unlikely	Can be too aggressive without good health signals
78Fallback Cache	Serving acceptable stale data when live dependencies fail	Staleness must be visible and bounded
79Multi-Region Active-Active	Serving writes and reads from more than one region	Conflict resolution, data residency, and operational complexity rise sharply
80Active-Passive Failover	Recovering from a primary site failure	Rpo/rto depend on replication and rehearsed runbooks
81Cell-Based Architecture	Containing blast radius as a platform grows	Capacity balancing and cross-cell operations get harder
82Shuffle Sharding	Reducing how many tenants share the same failure domain	Routing and capacity math are more complex
83Static Stability	Surviving dependency failure without immediate scaling or coordination	Costs more up front and requires discipline not to rely on emergency scaling
84Strangler Fig	Replacing legacy systems incrementally	Routing, data synchronization, and cutover criteria must be explicit
85Branch by Abstraction	Changing implementations without long-lived feature branches	The abstraction can leak or become permanent if not retired
86Parallel Run	Validating a new system against an old one	Double-running increases cost and comparison logic must handle legitimate differences
87Shadow Traffic	Testing a new service with production-shaped input safely	Privacy, side effects, and amplified load must be controlled
88Feature Flag	Changing behavior without redeploying	Flag debt and inconsistent states need lifecycle management
89Blue-Green Deployment	Cutting over between two complete environments	Requires duplicate capacity and careful database compatibility
90Rolling Deployment	Updating a fleet gradually without full downtime	Mixed versions must be compatible during the rollout
91Schema Versioning	Changing data contracts without breaking old readers	Old versions and migration states must be actively retired
92Expand-Contract Migration	Changing schemas while old and new code overlap	Takes multiple deployments and careful observability
93Tombstone	Representing deletes safely in replicated/evented systems	Tombstones consume storage and retention must exceed replication lag
94Soft Delete	Allowing recovery and audit after deletion	Queries must consistently filter deleted data; privacy rules may require hard delete
95Data Retention Window	Bounding storage, privacy, and replay obligations	Retention must reconcile legal, product, and operational needs
96Audit Log	Explaining who changed what and when	Audit data is sensitive and must be protected from tampering
97Policy Decision Point / Policy Enforcement Point	Separating authorization decisions from enforcement locations	Latency and availability of policy checks become critical
98Secret Rotation	Changing credentials without downtime	Every consumer must be discoverable and rotation must be rehearsed
99Envelope Encryption	Protecting data with manageable key rotation	Key hierarchy, access control, and recovery procedures add complexity
100Control Plane / Data Plane Split	Keeping management operations separate from request serving	Control-plane outages must not immediately stop stable data-plane traffic