Why Webhooks Fail (And How to Stop Losing Events)

Webhooks seem simple. A third-party service sends an HTTP POST to your server, your server processes it, responds with 200 OK. Done.

In practice, this breaks constantly — and when it breaks, you lose real data. Payment confirmations disappear. Order fulfillment stalls. Customer accounts don't get provisioned.

Here's a systematic breakdown of why webhooks fail, with concrete numbers and engineering patterns to fix each failure mode.

The State of Webhook Reliability

Before diving into failure modes, let's set the baseline. Based on analysis across production systems:

Metric	Typical Value
First-attempt delivery success rate	94–97%
Events lost without retry infrastructure	3–6%
Average retry time to success (2nd attempt)	2–5 minutes
Events successfully delivered by 3rd attempt	99.2%
Events that reach dead-letter (5+ failures)	0.1–0.3%

A 3–6% first-attempt failure rate sounds small, but at 1M events/month that's 30,000–60,000 lost events per month without retry infrastructure.

Failure Mode #1: Timeout During Processing

What happens: Your webhook handler accepts the request, starts processing (database write, external API call, etc.), and the processing takes longer than the provider's timeout window.

Most providers timeout after 5 seconds (Stripe: 30 seconds, GitHub: 10 seconds, Shopify: 5 seconds). If your handler doesn't respond within that window, the provider marks the delivery as failed and schedules a retry.

The hidden problem: Your handler might have already processed the event (partially or fully) before timing out. Now you get a retry for an event you already handled. Hello, duplicate charges.

Fix pattern:

1. Respond with 200 immediately (within 200ms)
2. Push event to durable queue
3. Process asynchronously
4. Implement idempotency keys

This is the "accept fast, process durably" pattern. Your webhook handler should do nothing except validate the signature and enqueue the event.

Failure Mode #2: Your Server Was Restarted

What happens: You deploy a new version. Your server restarts. For 2–30 seconds, no process is listening on port 8080. Any webhook arriving in that window gets a connection refused error.

Frequency: Every deploy. If you deploy 5 times per day and each deploy has a 5-second gap, that's 25 seconds of unavailability — enough to miss 3–10 events at typical SaaS webhook volumes.

Fix: Use a webhook gateway (like GetHook) that accepts events 24/7 and delivers them to your backend. Your backend restarts don't affect event acceptance — events queue and retry automatically.

Failure Mode #3: Database Write Failure

What happens: Your webhook handler writes to the database. The database is briefly unavailable (connection pool exhausted, replica failover, maintenance window). The handler returns 500. The provider retries.

The problem: Retries from providers aren't guaranteed to be fast. Stripe retries failed webhooks over 3 days. GitHub retries over 72 hours. If your database was down for 2 minutes, you'll receive duplicates hours later — long after your system has recovered.

Fix: Idempotency. Every event should have a stable ID. Your handler should check "have I already processed event evt_1234?" before doing anything.

// Example idempotency check
func handleWebhook(w http.ResponseWriter, r *http.Request) {
    eventID := r.Header.Get("Stripe-Event-ID") // or equivalent

    if alreadyProcessed(eventID) {
        w.WriteHeader(200) // ack the duplicate
        return
    }

    markProcessing(eventID)
    processEvent(r.Body)
    markProcessed(eventID)
}

Failure Mode #4: Memory Spike / OOM Kill

What happens: A large webhook payload (or a burst of small ones) causes your process to spike in memory. The OOM killer terminates your process. Events in flight are lost.

Fix: Never hold event payloads in memory longer than the time to write them to a durable store. Accept → write to queue → acknowledge. The queue is on disk; memory is ephemeral.

Failure Mode #5: Wrong Status Code

This is embarrassingly common. Your handler does the right thing but returns the wrong HTTP status.

Status you returned	What the provider does
200 OK	✅ Marks delivered
201 Created	✅ Most providers accept 2xx
204 No Content	✅ Most providers accept 2xx
301 Redirect	❌ Most providers don't follow redirects
400 Bad Request	❌ Some providers treat as permanent failure (no retry)
401 Unauthorized	❌ Permanent failure, stops retrying
500 Internal Server Error	⚠️ Retried, but logged as failure
Timeout / no response	⚠️ Retried

Fix: Always return 200 OK from webhook endpoints. Return it immediately before processing. If processing fails, let your internal retry mechanism handle it — not the provider's retry.

Failure Mode #6: HMAC Verification Bug

What happens: You implement signature verification, but you have a subtle bug — maybe you're comparing strings instead of doing a constant-time comparison, or you're computing HMAC on the wrong payload encoding. Some events verify correctly, others fail randomly.

The killer: if you return 400 for a signature verification failure, most providers stop retrying permanently and mark the endpoint as broken.

Timing attack vulnerability:

// ❌ Vulnerable — timing side-channel
if computedSig != receivedSig {
    return false
}

// ✅ Safe — constant-time comparison
if !hmac.Equal([]byte(computedSig), []byte(receivedSig)) {
    return false
}

Fix: Use a battle-tested library for signature verification. GetHook handles this at the gateway so your handler never sees events that fail signature verification.

Failure Mode #7: Fan-Out Partial Failure

What happens: You receive a payment.succeeded event. You need to:

›Update your database
›Send a confirmation email
›Notify your fulfillment service
›Update your analytics platform

Steps 1–3 succeed. Step 4 fails. What do you do?

Bad pattern: Roll back everything and return 500. The provider retries. Steps 1–3 run again with duplicates.

Good pattern: Fan out to independent queues, one per destination. Each destination retries independently. A fulfillment service outage doesn't cause email to stop working.

This is what GetHook's routing engine does — each route to each destination is tracked independently. A failure on one destination doesn't affect others.

The Retry Backoff Problem

Even with retry logic, the retry schedule matters enormously.

If your retry logic runs retries too aggressively, you'll hammer a temporarily unavailable destination and make the outage worse. If it retries too slowly, recovery takes hours.

The industry-standard pattern is exponential backoff with jitter:

Attempt	Delay	Purpose
1st	Immediately	Fast path success
2nd	30 seconds	Short transient failures
3rd	2 minutes	Brief outages
4th	10 minutes	Medium-length outages
5th	1 hour	Longer outages
Dead letter	—	Manual investigation

The jitter (random ±10–20% on each delay) prevents the "thundering herd" problem where many events retry simultaneously after an outage resolves.

Measuring Your Webhook Reliability

If you're operating your own webhook infrastructure, here are the key metrics to track:

Metric	Target	Alert if
First-attempt success rate	> 97%	< 95%
Delivered within 30s	> 95%	< 90%
Dead-letter rate	< 0.5%	> 1%
P99 delivery latency	< 5 seconds	> 30 seconds
Duplicate delivery rate	< 1%	> 2%

The Bottom Line

Webhooks fail for predictable, fixable reasons. The engineering patterns are well-understood:

›Accept fast, process durably — queue everything before responding
›Idempotency keys — handle duplicates gracefully
›Fan-out independently — per-destination retry
›Exponential backoff with jitter — smart retry scheduling
›Constant-time signature comparison — no timing vulnerabilities
›Return 200 immediately — let your internal queue handle the rest

Building all of this yourself is achievable. It's also 6–9 weeks of engineering work, as we covered in our cost analysis post.

GetHook implements all seven of these patterns so you don't have to. Try it free →