Scaling Webhook Infrastructure to 1 Million Events Per Day

One million events per day sounds like a lot. It's approximately 11.6 events per second sustained. For comparison, a medium-sized SaaS company might process 100K–500K webhook events per day during business hours, with 5–10× peak-to-average ratios.

This is entirely achievable on a well-tuned Postgres stack. Here's how we planned for it and what we learned.

Capacity Model: Working Backwards From the Target

Let's start with math.

Metric	Value
Target volume	1,000,000 events / day
Average events / second (sustained)	11.6
Peak multiplier (business hours vs. off-hours)	4–5×
Peak events / second	~58
Delivery attempts per event (avg, including retries)	1.2
Delivery HTTP calls / second at peak	~70
Average delivery response time	300ms
Concurrent delivery workers needed (at peak)	~70 × 0.3s = ~21

So at 1M events/day with typical retry rates, you need ~25 concurrent workers to stay on top of the queue at peak load.

The Database Tier

Storage Sizing

At 1M events/day:

Data	Estimate	Storage
Event rows (1M/day × 365 days × 1KB avg)	365M rows	~365 GB / year
Delivery attempt rows (1.2× events)	438M rows	~175 GB / year
Indexes	~30% overhead	~160 GB / year
Total per year		~700 GB

You won't retain all events forever. GetHook's default retention is 90 days, which caps storage at:

1M events/day × 90 days × 1KB = 90 GB events
1.2M attempts/day × 90 days × 500B = 54 GB attempts
+ indexes: ~43 GB
= ~187 GB total

A 250 GB Postgres instance handles this comfortably.

Query Performance

The two hot queries are:

1. Worker poll (runs continuously, every 1–5 seconds per worker):

sql

SELECT id, payload, destination_id, attempt_number
FROM events
WHERE status IN ('queued', 'retry_scheduled')
  AND next_attempt_at <= NOW()
ORDER BY next_attempt_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED

2. Event list (dashboard, runs per user request):

sql

SELECT * FROM events
WHERE account_id = $1
ORDER BY created_at DESC
LIMIT 50

Both are fast with the right indexes. Here's what you need:

sql

-- Worker poll index — partial, covers only undelivered events
CREATE INDEX idx_events_queue
  ON events (next_attempt_at, status)
  WHERE status IN ('queued', 'retry_scheduled');

-- Dashboard list index
CREATE INDEX idx_events_account_time
  ON events (account_id, created_at DESC);

-- Status filter (for filtering by status in dashboard)
CREATE INDEX idx_events_account_status
  ON events (account_id, status, created_at DESC);

Without the partial index on the worker poll, the query scans all events. At 1M rows, this becomes visibly slow. The partial index only includes undelivered events (a small fraction), so it stays small and fast regardless of total table size.

Write Throughput

At 1M events/day, you're inserting ~11.6 rows/second. Postgres handles tens of thousands of writes per second on modest hardware. Inserts are not a bottleneck at this scale.

What can become a bottleneck: status updates. Every delivery attempt updates the event row. At 14 updates/second (11.6 events × 1.2 attempts), this is still well within Postgres limits.

Connection Pooling

Each worker process maintains a database connection pool. With 25 workers and 10 connections each, that's 250 total connections to Postgres.

Default Postgres max connections is 100. You need either:

›Increase max_connections (up to 1000 is reasonable with sufficient memory)
›Use a connection pooler (PgBouncer in transaction mode)

GetHook uses PgBouncer in production:

ini

[databases]
gethook = host=127.0.0.1 port=5432 dbname=gethook

[pgbouncer]
pool_mode = transaction    ; transaction-level pooling
max_client_conn = 500      ; max total connections from app
default_pool_size = 20     ; actual connections to Postgres per database

Transaction-mode pooling allows 500 application connections to share 20 Postgres connections. This is possible because each worker holds the connection for less than 100ms per transaction.

Worker Architecture

Single Process vs. Multiple Workers

You have two choices for worker parallelism:

Option A: Multiple goroutines in one process

// One process, many goroutines
for i := 0; i < workerCount; i++ {
    go worker.Run(ctx)
}

Option B: Multiple processes (horizontally scaled)

worker-1: process running on server A
worker-2: process running on server A
worker-3: process running on server B
worker-4: process running on server B

GetHook uses Option A for simplicity — one worker process with configurable concurrency. For horizontal scaling, you run multiple instances of the worker binary, and FOR UPDATE SKIP LOCKED handles coordination automatically.

Worker Concurrency Model

// Internal worker concurrency
type Worker struct {
    db          *sql.DB
    concurrency int      // configurable: default 10
    pollInterval time.Duration
}

func (w *Worker) Run(ctx context.Context) error {
    sem := make(chan struct{}, w.concurrency)

    for {
        events := w.pollEvents(ctx)
        for _, event := range events {
            sem <- struct{}{}  // acquire slot
            go func(e Event) {
                defer func() { <-sem }()  // release slot
                w.deliver(ctx, e)
            }(event)
        }

        if len(events) == 0 {
            time.Sleep(w.pollInterval)
        }
    }
}

With concurrency = 10 and an average delivery time of 300ms, a single worker process handles ~33 deliveries/second. Two workers handle ~66/second — more than enough for 1M events/day.

Outbound HTTP Performance

The delivery step — actually making the HTTP call to the destination — dominates worker latency.

Timeout Strategy

Destination timeout	Use case
5 seconds	Default, fast destinations
15 seconds	Slower destinations (LLM APIs, batch processors)
30 seconds	Custom configuration, never higher

Long timeouts tie up worker goroutines. With 10 workers each capable of 10 concurrent deliveries, and an average destination timeout of 30 seconds, you could theoretically block all 100 concurrent delivery slots waiting for timeouts.

Fix: Keep default timeouts short (5s). Let users increase per-destination if needed.

Connection Reuse

GetHook uses a single shared http.Client with connection reuse enabled:

httpClient = &http.Client{
    Timeout: 10 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 20,
        IdleConnTimeout:     90 * time.Second,
    },
}

Without connection reuse, each delivery requires a full TCP handshake + TLS handshake (~100–300ms). With a keep-alive pool, subsequent deliveries to the same host reuse the existing connection (~5ms).

At 1M events/day to a handful of destinations, connection reuse can save 100–300ms per delivery for the majority of events.

Observability at Scale

With 1M events/day, you need metrics to understand what's happening:

Metric	How to track
Events queued	Gauge: `SELECT COUNT(*) FROM events WHERE status = 'queued'`
Worker throughput	Counter: increment on each delivery attempt
Dead-letter rate	Gauge: `SELECT COUNT(*) FROM events WHERE status = 'dead_letter' AND created_at > NOW() - INTERVAL '1 hour'`
Delivery latency	Histogram: time from `created_at` to `delivered_at`
Destination success rate	Gauge per destination

Dashboard query for the events-in-queue metric should run every 30 seconds. It's fast with the partial index.

Load Testing Results

Here's what GetHook's Postgres-backed stack benchmarks at on a db.t3.medium RDS instance (2 vCPU, 4GB RAM):

Scenario	Events/sec	p50 Latency	p99 Latency	Worker CPU
Single worker, 10 concurrent	32	280ms	1.2s	15%
2 workers, 10 concurrent each	58	310ms	1.5s	28%
5 workers, 10 concurrent each	120	320ms	1.8s	65%
10 workers, 10 concurrent each	180	380ms	2.4s	88%

1M events/day ≈ 11.6 events/second sustained, with ~58 peak. A single worker with 10 concurrent deliveries handles this load with room to spare.

The Right Infrastructure Size

For 1M events/day (90-day retention):

Component	Recommended	Monthly Cost (AWS)
API server	t3.small (1 vCPU, 2GB)	~$15
Worker process	t3.small (1 vCPU, 2GB)	~$15
PostgreSQL (RDS)	db.t3.medium (2 vCPU, 4GB, 250GB SSD)	~$120
PgBouncer	Runs on worker instance	$0
Load balancer	ALB	~$20
Total		~$170/month

Compare this to the cost of building and maintaining this infrastructure yourself (from our cost analysis post).

Summary

1M events/day is a comfortable mid-range scale for a Postgres-backed webhook system. Key decisions that make it work:

›Partial indexes on the worker poll query — critical for performance at scale
›FOR UPDATE SKIP LOCKED — clean concurrent worker coordination without a broker
›PgBouncer — allows many worker goroutines without exhausting Postgres connections
›Short HTTP timeouts — prevents stuck workers from blocking the queue
›Connection reuse — eliminates redundant TLS handshakes to repeated destinations

For most SaaS companies, this architecture handles their full growth curve without ever needing to add Redis or Kafka.