Retry logic sounds simple: if something fails, try again. In practice, naive retry implementations make failures worse, not better.
This post explains why, and covers the backoff pattern that makes retry safe and effective.
Why Naive Retry Causes Outages
Imagine you have a destination service that becomes temporarily unavailable (database overloaded, memory spike, deployment restart). You have 10,000 queued webhook events.
With naive immediate retry:
t=0:00 — Destination goes down
t=0:01 — 10,000 retries fire simultaneously
t=0:02 — 10,000 more retries fire
t=0:03 — Destination comes back online...
t=0:04 — ...but is immediately overwhelmed by 30,000+ pending retries
t=0:05 — Destination goes down again
You've turned a 30-second outage into a 5-minute cascading failure. This is the thundering herd problem.
The fix: space out your retries, and stagger them so they don't all arrive at the same moment.
The Components of Good Retry Logic
1. Exponential Backoff
Each retry waits longer than the previous one:
Attempt 1: 0 seconds (immediate)
Attempt 2: 30 seconds
Attempt 3: 2 minutes
Attempt 4: 10 minutes
Attempt 5: 1 hour
→ Dead letter
The wait time grows exponentially (roughly 4× per attempt). This gives the destination:
- ›Time to recover from transient failures
- ›Progressively more time for longer-duration issues
- ›A natural limit so you don't retry forever
2. Jitter
Without jitter, all events from the same failure window retry at the same moment (the "thundering herd"). Jitter adds a random offset:
func nextRetryDelay(attempt int) time.Duration {
base := []time.Duration{
0,
30 * time.Second,
2 * time.Minute,
10 * time.Minute,
1 * time.Hour,
}
if attempt >= len(base) {
return -1 // dead letter
}
delay := base[attempt]
if delay == 0 {
return 0 // first attempt is immediate
}
// Add ±20% jitter
jitterRange := int64(float64(delay) * 0.2)
jitter := time.Duration(rand.Int63n(jitterRange*2) - jitterRange)
return delay + jitter
}
With jitter, events that were all received at the same moment (say, during a burst) will retry at slightly different times, spreading load across the retry window.
3. Dead-Letter Queue
Not all failures are transient. If you've retried 5 times over 1+ hours and the destination is still failing, it's likely:
- ›The destination URL changed
- ›The service was decommissioned
- ›There's a permanent error in your handler
These events go to a dead-letter queue for manual inspection and replay.
func (w *Worker) handleDeliveryFailure(ctx context.Context, event Event, err error) {
if event.AttemptNumber >= event.MaxAttempts {
// Move to dead letter
w.store.UpdateStatus(ctx, event.ID, StatusDeadLetter)
w.metrics.IncrDeadLetter()
return
}
// Schedule retry
delay := nextRetryDelay(event.AttemptNumber)
nextAttemptAt := time.Now().Add(delay)
w.store.ScheduleRetry(ctx, event.ID, nextAttemptAt, event.AttemptNumber+1)
}
GetHook's Retry Schedule
GetHook uses this schedule for all delivery attempts:
| Attempt | Delay | Cumulative time | Notes |
|---|---|---|---|
| 1st | Immediate | 0 | Fast path |
| 2nd | 30s ± 6s | ~30 seconds | Short transient |
| 3rd | 2m ± 24s | ~2.5 minutes | Brief outage |
| 4th | 10m ± 2m | ~13 minutes | Medium outage |
| 5th | 1h ± 12m | ~1.5 hours | Long outage |
| Dead letter | — | — | Manual replay |
The ±jitter ranges are ±20% of the base delay.
This schedule recovers from:
- ›Deployments (30-second gap) → caught by attempt 2
- ›Brief database issues (2–5 minutes) → caught by attempt 3
- ›Full service restarts (10–15 minutes) → caught by attempt 4
- ›Extended outages (30–90 minutes) → caught by attempt 5
Retry on Which Failures?
Not every failure should be retried. Some failures indicate permanent errors:
| HTTP Status | Retry? | Reason |
|---|---|---|
| 200–299 | No (success) | Delivered |
| 301/302 | No | Provider misconfigured — don't follow |
| 400 | No | Permanent error (bad event format) |
| 401 | No | Unauthorized — destination misconfigured |
| 403 | No | Forbidden — same as 401 |
| 404 | No | Destination removed |
| 408 | Yes | Request timeout |
| 429 | Yes (with longer delay) | Rate limited |
| 500 | Yes | Server error |
| 502 | Yes | Bad gateway |
| 503 | Yes | Service unavailable |
| 504 | Yes | Gateway timeout |
| Connection refused | Yes | Service temporarily down |
| DNS error | Yes (limited) | Possible transient DNS issue |
Key decision: Treat 4xx (except 408 and 429) as permanent failures. Many providers stop retrying to an endpoint that consistently returns 4xx, treating it as "the endpoint is broken or doesn't want these events."
The Rate Limiting Problem
When a destination returns 429 Too Many Requests, it's often accompanied by a Retry-After header:
HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json
{"error": "rate limit exceeded", "retry_after": 60}
You should respect this:
func retryDelayForStatus(statusCode int, headers http.Header, attempt int) time.Duration {
if statusCode == 429 {
retryAfter := headers.Get("Retry-After")
if retryAfter != "" {
if seconds, err := strconv.Atoi(retryAfter); err == nil {
return time.Duration(seconds) * time.Second
}
}
// No header — use longer backoff for rate limits
return nextRetryDelay(attempt) * 2
}
return nextRetryDelay(attempt)
}
Ignoring Retry-After will get your delivery IP blocked by some destinations.
Persistence: Retry State Must Survive Worker Restarts
Retry state (how many attempts, next retry time) must be persisted to durable storage before the worker acknowledges it. If the worker crashes between scheduling a retry and persisting the state, the event is lost.
GetHook stores retry state in Postgres as part of the event row:
-- After a failed delivery attempt
UPDATE events
SET status = 'retry_scheduled',
attempt_number = attempt_number + 1,
next_attempt_at = NOW() + INTERVAL '30 seconds',
last_error = $2
WHERE id = $1
The FOR UPDATE SKIP LOCKED pattern in the worker poll ensures two workers never pick up the same event simultaneously. The state is always consistent.
Monitoring Retry Behavior
Key metrics to track:
| Metric | Healthy range | Alert if |
|---|---|---|
| Retry rate (% of events retried) | < 5% | > 10% |
| Dead-letter rate (% reaching DLQ) | < 0.5% | > 1% |
| Avg attempts to success | 1.0–1.3 | > 2.0 |
| Events in retry > 24 hours | < 10 | > 100 |
| 2nd-attempt success rate | > 80% | < 60% |
A spike in the retry rate is often the first signal of a problem with a specific destination or a provider format change.
Implementation Checklist
When implementing retry logic from scratch:
- › Exponential delays, not linear or constant
- › Jitter on all delays (not just the last one)
- › Dead-letter after max attempts
- › Persist retry state to durable storage (not in-memory)
- › Respect
Retry-Afterheaders for 429 responses - › Don't retry on 4xx (except 408/429)
- › Track retry metrics (retry rate, DLQ rate, avg attempts)
- › Alert on DLQ growth — someone should review these
- › Provide replay from DLQ — operators need this to recover
The Compounding Value of Good Retry Logic
The difference between 95% and 99.5% delivery success rates looks small. At scale, it's enormous:
| Volume | 95% success | 99.5% success | Difference |
|---|---|---|---|
| 100K events/month | 5,000 lost | 500 lost | 4,500 saved |
| 1M events/month | 50,000 lost | 5,000 lost | 45,000 saved |
| 10M events/month | 500,000 lost | 50,000 lost | 450,000 saved |
Most of the gap between 95% and 99.5% is closed by good retry logic — not by making the system more reliable in the first place. First-attempt failures happen; retry recovers them.
GetHook's retry logic, combined with per-destination dead-letter queues and manual replay, achieves > 99.5% event delivery in production. See how it works →