Webhooks seem simple. A third-party service sends an HTTP POST to your server, your server processes it, responds with 200 OK. Done.
In practice, this breaks constantly — and when it breaks, you lose real data. Payment confirmations disappear. Order fulfillment stalls. Customer accounts don't get provisioned.
Here's a systematic breakdown of why webhooks fail, with concrete numbers and engineering patterns to fix each failure mode.
The State of Webhook Reliability
Before diving into failure modes, let's set the baseline. Based on analysis across production systems:
| Metric | Typical Value |
|---|---|
| First-attempt delivery success rate | 94–97% |
| Events lost without retry infrastructure | 3–6% |
| Average retry time to success (2nd attempt) | 2–5 minutes |
| Events successfully delivered by 3rd attempt | 99.2% |
| Events that reach dead-letter (5+ failures) | 0.1–0.3% |
A 3–6% first-attempt failure rate sounds small, but at 1M events/month that's 30,000–60,000 lost events per month without retry infrastructure.
Failure Mode #1: Timeout During Processing
What happens: Your webhook handler accepts the request, starts processing (database write, external API call, etc.), and the processing takes longer than the provider's timeout window.
Most providers timeout after 5 seconds (Stripe: 30 seconds, GitHub: 10 seconds, Shopify: 5 seconds). If your handler doesn't respond within that window, the provider marks the delivery as failed and schedules a retry.
The hidden problem: Your handler might have already processed the event (partially or fully) before timing out. Now you get a retry for an event you already handled. Hello, duplicate charges.
Fix pattern:
1. Respond with 200 immediately (within 200ms)
2. Push event to durable queue
3. Process asynchronously
4. Implement idempotency keys
This is the "accept fast, process durably" pattern. Your webhook handler should do nothing except validate the signature and enqueue the event.
Failure Mode #2: Your Server Was Restarted
What happens: You deploy a new version. Your server restarts. For 2–30 seconds, no process is listening on port 8080. Any webhook arriving in that window gets a connection refused error.
Frequency: Every deploy. If you deploy 5 times per day and each deploy has a 5-second gap, that's 25 seconds of unavailability — enough to miss 3–10 events at typical SaaS webhook volumes.
Fix: Use a webhook gateway (like GetHook) that accepts events 24/7 and delivers them to your backend. Your backend restarts don't affect event acceptance — events queue and retry automatically.
Failure Mode #3: Database Write Failure
What happens: Your webhook handler writes to the database. The database is briefly unavailable (connection pool exhausted, replica failover, maintenance window). The handler returns 500. The provider retries.
The problem: Retries from providers aren't guaranteed to be fast. Stripe retries failed webhooks over 3 days. GitHub retries over 72 hours. If your database was down for 2 minutes, you'll receive duplicates hours later — long after your system has recovered.
Fix: Idempotency. Every event should have a stable ID. Your handler should check "have I already processed event evt_1234?" before doing anything.
// Example idempotency check
func handleWebhook(w http.ResponseWriter, r *http.Request) {
eventID := r.Header.Get("Stripe-Event-ID") // or equivalent
if alreadyProcessed(eventID) {
w.WriteHeader(200) // ack the duplicate
return
}
markProcessing(eventID)
processEvent(r.Body)
markProcessed(eventID)
}
Failure Mode #4: Memory Spike / OOM Kill
What happens: A large webhook payload (or a burst of small ones) causes your process to spike in memory. The OOM killer terminates your process. Events in flight are lost.
Fix: Never hold event payloads in memory longer than the time to write them to a durable store. Accept → write to queue → acknowledge. The queue is on disk; memory is ephemeral.
Failure Mode #5: Wrong Status Code
This is embarrassingly common. Your handler does the right thing but returns the wrong HTTP status.
| Status you returned | What the provider does |
|---|---|
| 200 OK | ✅ Marks delivered |
| 201 Created | ✅ Most providers accept 2xx |
| 204 No Content | ✅ Most providers accept 2xx |
| 301 Redirect | ❌ Most providers don't follow redirects |
| 400 Bad Request | ❌ Some providers treat as permanent failure (no retry) |
| 401 Unauthorized | ❌ Permanent failure, stops retrying |
| 500 Internal Server Error | ⚠️ Retried, but logged as failure |
| Timeout / no response | ⚠️ Retried |
Fix: Always return 200 OK from webhook endpoints. Return it immediately before processing. If processing fails, let your internal retry mechanism handle it — not the provider's retry.
Failure Mode #6: HMAC Verification Bug
What happens: You implement signature verification, but you have a subtle bug — maybe you're comparing strings instead of doing a constant-time comparison, or you're computing HMAC on the wrong payload encoding. Some events verify correctly, others fail randomly.
The killer: if you return 400 for a signature verification failure, most providers stop retrying permanently and mark the endpoint as broken.
Timing attack vulnerability:
// ❌ Vulnerable — timing side-channel
if computedSig != receivedSig {
return false
}
// ✅ Safe — constant-time comparison
if !hmac.Equal([]byte(computedSig), []byte(receivedSig)) {
return false
}
Fix: Use a battle-tested library for signature verification. GetHook handles this at the gateway so your handler never sees events that fail signature verification.
Failure Mode #7: Fan-Out Partial Failure
What happens: You receive a payment.succeeded event. You need to:
- ›Update your database
- ›Send a confirmation email
- ›Notify your fulfillment service
- ›Update your analytics platform
Steps 1–3 succeed. Step 4 fails. What do you do?
Bad pattern: Roll back everything and return 500. The provider retries. Steps 1–3 run again with duplicates.
Good pattern: Fan out to independent queues, one per destination. Each destination retries independently. A fulfillment service outage doesn't cause email to stop working.
This is what GetHook's routing engine does — each route to each destination is tracked independently. A failure on one destination doesn't affect others.
The Retry Backoff Problem
Even with retry logic, the retry schedule matters enormously.
If your retry logic runs retries too aggressively, you'll hammer a temporarily unavailable destination and make the outage worse. If it retries too slowly, recovery takes hours.
The industry-standard pattern is exponential backoff with jitter:
| Attempt | Delay | Purpose |
|---|---|---|
| 1st | Immediately | Fast path success |
| 2nd | 30 seconds | Short transient failures |
| 3rd | 2 minutes | Brief outages |
| 4th | 10 minutes | Medium-length outages |
| 5th | 1 hour | Longer outages |
| Dead letter | — | Manual investigation |
The jitter (random ±10–20% on each delay) prevents the "thundering herd" problem where many events retry simultaneously after an outage resolves.
Measuring Your Webhook Reliability
If you're operating your own webhook infrastructure, here are the key metrics to track:
| Metric | Target | Alert if |
|---|---|---|
| First-attempt success rate | > 97% | < 95% |
| Delivered within 30s | > 95% | < 90% |
| Dead-letter rate | < 0.5% | > 1% |
| P99 delivery latency | < 5 seconds | > 30 seconds |
| Duplicate delivery rate | < 1% | > 2% |
The Bottom Line
Webhooks fail for predictable, fixable reasons. The engineering patterns are well-understood:
- ›Accept fast, process durably — queue everything before responding
- ›Idempotency keys — handle duplicates gracefully
- ›Fan-out independently — per-destination retry
- ›Exponential backoff with jitter — smart retry scheduling
- ›Constant-time signature comparison — no timing vulnerabilities
- ›Return 200 immediately — let your internal queue handle the rest
Building all of this yourself is achievable. It's also 6–9 weeks of engineering work, as we covered in our cost analysis post.
GetHook implements all seven of these patterns so you don't have to. Try it free →