CONCEPT·LEVEL 11

Tame the Spike

When you can't scale fast enough, throttle the firehose.

01The problem with uncontrolled traffic

Imagine a nightclub bouncer with a clicker at the door. Once capacity is reached, no one else gets in: not because the bouncer is being cruel, but because letting in more people than the venue can safely hold would make it miserable or dangerous for everyone inside. A rate limiter plays the same role: it caps arrivals at a sustainable rate so the downstream service stays healthy.

Without a rate limiter, a sudden spike (a bot, a retry storm, or viral traffic) floods your server with more requests than it can handle. In-flight requests pile up, new arrivals are dropped indiscriminately, and latency balloons for everyone, even well-behaved users. A rate limiter makes the dropping intentional and controlled.

02What a rate limiter does

The most common implementation is a token bucket. Tokens refill at a fixed rate (your sustained throughput limit). Each incoming request consumes one token. Arrivals that find an empty bucket are rejected immediately: they never reach the downstream service. The bucket has a maximum size that determines burst tolerance: a full bucket lets a short spike through before enforcement kicks in.

tokensPerTick
The sustained throughput rate: how many requests per tick your downstream can reliably handle. This is your steady-state limit.
bucketSize
The burst tolerance: how many tokens can accumulate when traffic is below the limit. A larger bucket lets short spikes through without dropping.
Intentional drops
Drops at the rate limiter are by design. They protect what's behind them from overload. A rate limiter that never drops isn't doing anything.
⚠️Match the limit to the bottleneck
Set tokensPerTick slightly under what the protected service can actually handle. Too high and the limiter is decorative: traffic still overwhelms the backend. Too low and you're rejecting requests that could have succeeded.
📋CHEATSHEET · QUICK REFERENCE
  • Token bucket: refill rate = sustained limit, bucket size = burst tolerance.
  • Place in front of a service that can't scale: third-party APIs, fragile downstreams, expensive ops.
READY TO BUILD

Bursty traffic overwhelms the downstream server. Insert a rate limiter to throttle arrivals at a sustainable rate; the limiter drops the excess so the server stays healthy.

▸ START EXERCISEALL LEVELS
📖 Glossary

Components

CacheA sticky-note pad that remembers recent answers so you don't look them up again.
CDN (Content Delivery Network)Copies of your content cached at servers close to users around the world.
Circuit BreakerA switch that trips open when a downstream service fails too much, stopping the bleeding.
Dead-Letter Queue (DLQ)A separate queue where failed or unprocessable messages get parked for inspection.
Durable QueueA queue that writes messages to disk so they survive crashes and restarts.
KafkaA durable, ordered message log that many consumers can read independently.
Load BalancerA traffic cop that splits incoming requests evenly across multiple servers.
Message BrokerMiddleware that routes messages between services so they don't talk directly.
PrimaryThe main database that accepts all write operations and leads replication.
QueueA waiting line where tasks sit until a worker is free to handle them.
Read ReplicaA database copy that only handles reads, taking pressure off the write primary.
RedisAn in-memory data store that is extremely fast for caches, counters, and sessions.
ReplicaA copy of a database that can serve reads and take over if the primary fails.
Service MeshA sidecar layer that handles networking between services: routing, retries, and TLS.
ShardOne slice of a split database: a node that holds its own piece of the data.
SQS (Simple Queue Service)Amazon's managed queue service: produce messages, let workers pull and process them.

Concepts

Acknowledge (ACK)A reply that says 'I got your message': the signal that work was accepted.
At-Least-Once DeliveryEvery message arrives at least once, but sometimes twice, so handle that.
AvailabilityWhat percentage of time your system is up and serving requests correctly.
BackpressureA slow consumer signalling producers to slow down before the queue explodes.
BottleneckThe one slow part of your system that makes everything else wait.
Cache HitYou asked the cache and it had the answer: fast, cheap, and instant.
Cache MissThe cache didn't have the answer, so you had to ask the slower database.
CAP TheoremYou can only have two of three: consistency, availability, and partition tolerance.
CapacityThe maximum load a component can handle before it starts dropping requests.
Cascading FailureOne failing service causes its callers to fail, which causes their callers to fail: dominos.
Cold StartThe slow startup when a service or cache has no warm state yet.
Connection PoolA pre-made set of database connections ready to reuse instead of making new ones.
ConsumerThe worker that pulls tasks from a queue and actually processes them.
DecouplingComponents that don't depend on each other's availability to do their jobs.
DNS (Domain Name System)The internet's phone book: turns 'api.myapp.com' into an IP address.
DurabilityOnce data is saved, it stays saved: even if the server crashes a second later.
Eventual ConsistencyAll copies of your data will agree eventually: just not necessarily right now.
Eventual DurabilityYour write is acknowledged before it hits disk: it will be saved soon, just not yet.
Exactly-Once DeliveryEvery message arrives exactly one time: no duplicates, no losses.
Garbage Collection (GC)Automatic cleanup of unused memory, but it can pause your app while running.
Head-of-Line BlockingOne stuck request at the front of the line holds up everyone waiting behind it.
Health CheckAn endpoint that tells load balancers whether a server is ready to take traffic.
Hit RateWhat percentage of requests your cache answers without touching the database.
Horizontal ScalingAdd more machines instead of making one machine bigger.
Hot KeyOne cache or shard key that gets hammered with requests while all others sit idle.
IdempotencyThe property that makes doing something twice as safe as doing it once.
IdempotentDoing something once or a hundred times gives you exactly the same result.
In-FlightRequests that have been sent but are still waiting for a reply.
LatencyHow long it takes from clicking a button to seeing the result on screen.
Load SheddingDeliberately dropping some requests during overload so the rest can succeed.
MicroserviceA small, independently deployable service that owns one business function.
MiddlewareSoftware that sits between your app and the world, handling common plumbing tasks.
MonolithOne big application that does everything: one codebase, one deploy.
Network PartitionA network split where some nodes can't talk to others for a period of time.
ObservabilityBeing able to look inside your running system and understand what is happening.
p95 LatencyThe slowest response time that 95% of your users experience.
p99Only 1% of requests are slower than this: your worst typical performance.
PartitionA segment of data or infrastructure separated from others: by design or failure.
Poison PillA bad message that crashes every consumer that tries to process it.
ProducerThe part of the system that generates work and drops it into a queue.
Rate LimitingPutting a cap on how many requests a client can make in a given time window.
Read-HeavyYour system handles way more reads than writes: think 99% read, 1% write.
RetryTry again if the first attempt fails: transient issues often resolve themselves.
Round TripThe time for a message to travel to a server and for the reply to come back.
Service DiscoveryA registry that lets services find each other's addresses as they scale up and down.
Shard KeyThe field that decides which database slice stores each piece of data.
Single Point of FailureOne component whose failure brings the entire system to its knees.
SLA (Service Level Agreement)A written promise about how reliable and fast your service will be.
StatefulA service that remembers past interactions with a user or session.
StatelessA service with no memory: every request carries all the info it needs.
Strong ConsistencyAfter you write data, every reader instantly sees your change: no exceptions.
TCP (Transmission Control Protocol)The reliable internet delivery protocol that guarantees packets arrive in order.
ThroughputHow many requests your system can handle per second at full capacity.
Thundering HerdThousands of clients all retry at exactly the same moment, creating a stampede.
TimeoutGive up waiting for a slow response after a fixed amount of time.
TLS (Transport Layer Security)Encryption for data in transit so nobody can eavesdrop on your network calls.
Token BucketTokens refill at a fixed rate; each request spends one: run out and get rejected.
TTL (Time to Live)An expiry timer on cached data: after this many seconds, throw it out and refetch.
Vertical ScalingUpgrade one machine to be bigger and faster instead of adding more machines.
Write-Ahead Log (WAL)Write changes to a log before applying them so crashes can be recovered from.
Write-HeavyYour system writes data constantly: think logging, metrics, or IoT sensor streams.

Patterns

Blue-Green DeploymentDeploy to a second identical environment, then flip all traffic to it instantly.
Cache-AsideCheck cache first; on miss, fetch from DB and store the result for next time.
Canary DeploymentSend 5% of traffic to the new version first and watch for problems before going all-in.
ChoreographyServices react to events on their own: no conductor, just musicians watching each other.
Consistent HashingA way to map keys to nodes so adding or removing a node only moves a little data.
Event SourcingStore every change as an event log, then rebuild state by replaying those events.
Exponential BackoffWait longer after each failed retry: 1 s, then 2 s, then 4 s, then 8 s.
FailoverAutomatically switching to a backup when the main component breaks.
Fan-OutOne event triggers writes or notifications to many different services at once.
Fire-and-ForgetSend a message and don't wait: assume it will be processed eventually.
OrchestrationA central controller tells each service what to do and in what order.
Pub/Sub (Publish-Subscribe)Publishers broadcast events; subscribers listen for only the events they care about.
ReplicationMake exact copies of your database on multiple machines so nothing is lost if one breaks.
Round-RobinTake turns: send request 1 to Server A, request 2 to Server B, and repeat.
ShardingSplit a huge database into smaller pieces, each holding a different slice of data.
Write-AroundSkip the cache on writes: go straight to the database, let reads fill the cache.
Write-BehindWrite to cache instantly, then sync to database later in the background.
Write-ThroughWrite to both cache and database at the same time: both always agree.