Back to Blog
retry logicdistributed systemspatternsengineering

Exponential Backoff and Retry: The Pattern Behind Reliable Event Delivery

Why naive retry loops cause outages, how exponential backoff with jitter prevents thundering herds, and the exact retry schedule GetHook uses for webhook delivery.

D
Dmitri Volkov
Distributed Systems Engineer
October 5, 2025
8 min read

Retry logic sounds simple: if something fails, try again. In practice, naive retry implementations make failures worse, not better.

This post explains why, and covers the backoff pattern that makes retry safe and effective.


Why Naive Retry Causes Outages

Imagine you have a destination service that becomes temporarily unavailable (database overloaded, memory spike, deployment restart). You have 10,000 queued webhook events.

With naive immediate retry:

t=0:00 — Destination goes down t=0:01 — 10,000 retries fire simultaneously t=0:02 — 10,000 more retries fire t=0:03 — Destination comes back online... t=0:04 — ...but is immediately overwhelmed by 30,000+ pending retries t=0:05 — Destination goes down again

You've turned a 30-second outage into a 5-minute cascading failure. This is the thundering herd problem.

The fix: space out your retries, and stagger them so they don't all arrive at the same moment.


The Components of Good Retry Logic

1. Exponential Backoff

Each retry waits longer than the previous one:

Attempt 1: 0 seconds (immediate) Attempt 2: 30 seconds Attempt 3: 2 minutes Attempt 4: 10 minutes Attempt 5: 1 hour → Dead letter

The wait time grows exponentially (roughly 4× per attempt). This gives the destination:

  • Time to recover from transient failures
  • Progressively more time for longer-duration issues
  • A natural limit so you don't retry forever

2. Jitter

Without jitter, all events from the same failure window retry at the same moment (the "thundering herd"). Jitter adds a random offset:

go
func nextRetryDelay(attempt int) time.Duration {
    base := []time.Duration{
        0,
        30 * time.Second,
        2 * time.Minute,
        10 * time.Minute,
        1 * time.Hour,
    }

    if attempt >= len(base) {
        return -1 // dead letter
    }

    delay := base[attempt]
    if delay == 0 {
        return 0 // first attempt is immediate
    }

    // Add ±20% jitter
    jitterRange := int64(float64(delay) * 0.2)
    jitter := time.Duration(rand.Int63n(jitterRange*2) - jitterRange)
    return delay + jitter
}

With jitter, events that were all received at the same moment (say, during a burst) will retry at slightly different times, spreading load across the retry window.

3. Dead-Letter Queue

Not all failures are transient. If you've retried 5 times over 1+ hours and the destination is still failing, it's likely:

  • The destination URL changed
  • The service was decommissioned
  • There's a permanent error in your handler

These events go to a dead-letter queue for manual inspection and replay.

go
func (w *Worker) handleDeliveryFailure(ctx context.Context, event Event, err error) {
    if event.AttemptNumber >= event.MaxAttempts {
        // Move to dead letter
        w.store.UpdateStatus(ctx, event.ID, StatusDeadLetter)
        w.metrics.IncrDeadLetter()
        return
    }

    // Schedule retry
    delay := nextRetryDelay(event.AttemptNumber)
    nextAttemptAt := time.Now().Add(delay)
    w.store.ScheduleRetry(ctx, event.ID, nextAttemptAt, event.AttemptNumber+1)
}

GetHook's Retry Schedule

GetHook uses this schedule for all delivery attempts:

AttemptDelayCumulative timeNotes
1stImmediate0Fast path
2nd30s ± 6s~30 secondsShort transient
3rd2m ± 24s~2.5 minutesBrief outage
4th10m ± 2m~13 minutesMedium outage
5th1h ± 12m~1.5 hoursLong outage
Dead letterManual replay

The ±jitter ranges are ±20% of the base delay.

This schedule recovers from:

  • Deployments (30-second gap) → caught by attempt 2
  • Brief database issues (2–5 minutes) → caught by attempt 3
  • Full service restarts (10–15 minutes) → caught by attempt 4
  • Extended outages (30–90 minutes) → caught by attempt 5

Retry on Which Failures?

Not every failure should be retried. Some failures indicate permanent errors:

HTTP StatusRetry?Reason
200–299No (success)Delivered
301/302NoProvider misconfigured — don't follow
400NoPermanent error (bad event format)
401NoUnauthorized — destination misconfigured
403NoForbidden — same as 401
404NoDestination removed
408YesRequest timeout
429Yes (with longer delay)Rate limited
500YesServer error
502YesBad gateway
503YesService unavailable
504YesGateway timeout
Connection refusedYesService temporarily down
DNS errorYes (limited)Possible transient DNS issue

Key decision: Treat 4xx (except 408 and 429) as permanent failures. Many providers stop retrying to an endpoint that consistently returns 4xx, treating it as "the endpoint is broken or doesn't want these events."


The Rate Limiting Problem

When a destination returns 429 Too Many Requests, it's often accompanied by a Retry-After header:

http
HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json

{"error": "rate limit exceeded", "retry_after": 60}

You should respect this:

go
func retryDelayForStatus(statusCode int, headers http.Header, attempt int) time.Duration {
    if statusCode == 429 {
        retryAfter := headers.Get("Retry-After")
        if retryAfter != "" {
            if seconds, err := strconv.Atoi(retryAfter); err == nil {
                return time.Duration(seconds) * time.Second
            }
        }
        // No header — use longer backoff for rate limits
        return nextRetryDelay(attempt) * 2
    }
    return nextRetryDelay(attempt)
}

Ignoring Retry-After will get your delivery IP blocked by some destinations.


Persistence: Retry State Must Survive Worker Restarts

Retry state (how many attempts, next retry time) must be persisted to durable storage before the worker acknowledges it. If the worker crashes between scheduling a retry and persisting the state, the event is lost.

GetHook stores retry state in Postgres as part of the event row:

sql
-- After a failed delivery attempt
UPDATE events
SET status = 'retry_scheduled',
    attempt_number = attempt_number + 1,
    next_attempt_at = NOW() + INTERVAL '30 seconds',
    last_error = $2
WHERE id = $1

The FOR UPDATE SKIP LOCKED pattern in the worker poll ensures two workers never pick up the same event simultaneously. The state is always consistent.


Monitoring Retry Behavior

Key metrics to track:

MetricHealthy rangeAlert if
Retry rate (% of events retried)< 5%> 10%
Dead-letter rate (% reaching DLQ)< 0.5%> 1%
Avg attempts to success1.0–1.3> 2.0
Events in retry > 24 hours< 10> 100
2nd-attempt success rate> 80%< 60%

A spike in the retry rate is often the first signal of a problem with a specific destination or a provider format change.


Implementation Checklist

When implementing retry logic from scratch:

  • Exponential delays, not linear or constant
  • Jitter on all delays (not just the last one)
  • Dead-letter after max attempts
  • Persist retry state to durable storage (not in-memory)
  • Respect Retry-After headers for 429 responses
  • Don't retry on 4xx (except 408/429)
  • Track retry metrics (retry rate, DLQ rate, avg attempts)
  • Alert on DLQ growth — someone should review these
  • Provide replay from DLQ — operators need this to recover

The Compounding Value of Good Retry Logic

The difference between 95% and 99.5% delivery success rates looks small. At scale, it's enormous:

Volume95% success99.5% successDifference
100K events/month5,000 lost500 lost4,500 saved
1M events/month50,000 lost5,000 lost45,000 saved
10M events/month500,000 lost50,000 lost450,000 saved

Most of the gap between 95% and 99.5% is closed by good retry logic — not by making the system more reliable in the first place. First-attempt failures happen; retry recovers them.

GetHook's retry logic, combined with per-destination dead-letter queues and manual replay, achieves > 99.5% event delivery in production. See how it works →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.