Exponential Backoff and Retry: The Pattern Behind Reliable Event Delivery

Retry logic sounds simple: if something fails, try again. In practice, naive retry implementations make failures worse, not better.

This post explains why, and covers the backoff pattern that makes retry safe and effective.

Why Naive Retry Causes Outages

Imagine you have a destination service that becomes temporarily unavailable (database overloaded, memory spike, deployment restart). You have 10,000 queued webhook events.

With naive immediate retry:

t=0:00 — Destination goes down
t=0:01 — 10,000 retries fire simultaneously
t=0:02 — 10,000 more retries fire
t=0:03 — Destination comes back online...
t=0:04 — ...but is immediately overwhelmed by 30,000+ pending retries
t=0:05 — Destination goes down again

You've turned a 30-second outage into a 5-minute cascading failure. This is the thundering herd problem.

The fix: space out your retries, and stagger them so they don't all arrive at the same moment.

The Components of Good Retry Logic

1. Exponential Backoff

Each retry waits longer than the previous one:

Attempt 1: 0 seconds (immediate)
Attempt 2: 30 seconds
Attempt 3: 2 minutes
Attempt 4: 10 minutes
Attempt 5: 1 hour
→ Dead letter

The wait time grows exponentially (roughly 4× per attempt). This gives the destination:

›Time to recover from transient failures
›Progressively more time for longer-duration issues
›A natural limit so you don't retry forever

2. Jitter

Without jitter, all events from the same failure window retry at the same moment (the "thundering herd"). Jitter adds a random offset:

func nextRetryDelay(attempt int) time.Duration {
    base := []time.Duration{
        0,
        30 * time.Second,
        2 * time.Minute,
        10 * time.Minute,
        1 * time.Hour,
    }

    if attempt >= len(base) {
        return -1 // dead letter
    }

    delay := base[attempt]
    if delay == 0 {
        return 0 // first attempt is immediate
    }

    // Add ±20% jitter
    jitterRange := int64(float64(delay) * 0.2)
    jitter := time.Duration(rand.Int63n(jitterRange*2) - jitterRange)
    return delay + jitter
}

With jitter, events that were all received at the same moment (say, during a burst) will retry at slightly different times, spreading load across the retry window.

3. Dead-Letter Queue

Not all failures are transient. If you've retried 5 times over 1+ hours and the destination is still failing, it's likely:

›The destination URL changed
›The service was decommissioned
›There's a permanent error in your handler

These events go to a dead-letter queue for manual inspection and replay.

func (w *Worker) handleDeliveryFailure(ctx context.Context, event Event, err error) {
    if event.AttemptNumber >= event.MaxAttempts {
        // Move to dead letter
        w.store.UpdateStatus(ctx, event.ID, StatusDeadLetter)
        w.metrics.IncrDeadLetter()
        return
    }

    // Schedule retry
    delay := nextRetryDelay(event.AttemptNumber)
    nextAttemptAt := time.Now().Add(delay)
    w.store.ScheduleRetry(ctx, event.ID, nextAttemptAt, event.AttemptNumber+1)
}

GetHook's Retry Schedule

GetHook uses this schedule for all delivery attempts:

Attempt	Delay	Cumulative time	Notes
1st	Immediate	0	Fast path
2nd	30s ± 6s	~30 seconds	Short transient
3rd	2m ± 24s	~2.5 minutes	Brief outage
4th	10m ± 2m	~13 minutes	Medium outage
5th	1h ± 12m	~1.5 hours	Long outage
Dead letter	—	—	Manual replay

The ±jitter ranges are ±20% of the base delay.

This schedule recovers from:

›Deployments (30-second gap) → caught by attempt 2
›Brief database issues (2–5 minutes) → caught by attempt 3
›Full service restarts (10–15 minutes) → caught by attempt 4
›Extended outages (30–90 minutes) → caught by attempt 5

Retry on Which Failures?

Not every failure should be retried. Some failures indicate permanent errors:

HTTP Status	Retry?	Reason
200–299	No (success)	Delivered
301/302	No	Provider misconfigured — don't follow
400	No	Permanent error (bad event format)
401	No	Unauthorized — destination misconfigured
403	No	Forbidden — same as 401
404	No	Destination removed
408	Yes	Request timeout
429	Yes (with longer delay)	Rate limited
500	Yes	Server error
502	Yes	Bad gateway
503	Yes	Service unavailable
504	Yes	Gateway timeout
Connection refused	Yes	Service temporarily down
DNS error	Yes (limited)	Possible transient DNS issue

Key decision: Treat 4xx (except 408 and 429) as permanent failures. Many providers stop retrying to an endpoint that consistently returns 4xx, treating it as "the endpoint is broken or doesn't want these events."

The Rate Limiting Problem

When a destination returns 429 Too Many Requests, it's often accompanied by a Retry-After header:

http

HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json

{"error": "rate limit exceeded", "retry_after": 60}

You should respect this:

func retryDelayForStatus(statusCode int, headers http.Header, attempt int) time.Duration {
    if statusCode == 429 {
        retryAfter := headers.Get("Retry-After")
        if retryAfter != "" {
            if seconds, err := strconv.Atoi(retryAfter); err == nil {
                return time.Duration(seconds) * time.Second
            }
        }
        // No header — use longer backoff for rate limits
        return nextRetryDelay(attempt) * 2
    }
    return nextRetryDelay(attempt)
}

Ignoring Retry-After will get your delivery IP blocked by some destinations.

Persistence: Retry State Must Survive Worker Restarts

Retry state (how many attempts, next retry time) must be persisted to durable storage before the worker acknowledges it. If the worker crashes between scheduling a retry and persisting the state, the event is lost.

GetHook stores retry state in Postgres as part of the event row:

sql

-- After a failed delivery attempt
UPDATE events
SET status = 'retry_scheduled',
    attempt_number = attempt_number + 1,
    next_attempt_at = NOW() + INTERVAL '30 seconds',
    last_error = $2
WHERE id = $1

The FOR UPDATE SKIP LOCKED pattern in the worker poll ensures two workers never pick up the same event simultaneously. The state is always consistent.

Monitoring Retry Behavior

Key metrics to track:

Metric	Healthy range	Alert if
Retry rate (% of events retried)	< 5%	> 10%
Dead-letter rate (% reaching DLQ)	< 0.5%	> 1%
Avg attempts to success	1.0–1.3	> 2.0
Events in retry > 24 hours	< 10	> 100
2nd-attempt success rate	> 80%	< 60%

A spike in the retry rate is often the first signal of a problem with a specific destination or a provider format change.

Implementation Checklist

When implementing retry logic from scratch:

› Exponential delays, not linear or constant
› Jitter on all delays (not just the last one)
› Dead-letter after max attempts
› Persist retry state to durable storage (not in-memory)
› Respect Retry-After headers for 429 responses
› Don't retry on 4xx (except 408/429)
› Track retry metrics (retry rate, DLQ rate, avg attempts)
› Alert on DLQ growth — someone should review these
› Provide replay from DLQ — operators need this to recover

The Compounding Value of Good Retry Logic

The difference between 95% and 99.5% delivery success rates looks small. At scale, it's enormous:

Volume	95% success	99.5% success	Difference
100K events/month	5,000 lost	500 lost	4,500 saved
1M events/month	50,000 lost	5,000 lost	45,000 saved
10M events/month	500,000 lost	50,000 lost	450,000 saved

Most of the gap between 95% and 99.5% is closed by good retry logic — not by making the system more reliable in the first place. First-attempt failures happen; retry recovers them.

GetHook's retry logic, combined with per-destination dead-letter queues and manual replay, achieves > 99.5% event delivery in production. See how it works →