Every time I tell another engineer that GetHook runs its delivery queue on Postgres without Redis or Kafka, I get some variation of: "Really? Is that production-grade?"
Yes. Here's why it works, how it works, and when you should use a different approach.
The Postgres Job Queue Pattern
The core idea is simple. You have an events table with a status column and a next_attempt_at timestamp. Your worker polls for events that are ready to process:
-- Worker claims a batch of events atomically
SELECT id, payload, destination_id, attempt_number
FROM events
WHERE status = 'queued'
AND next_attempt_at <= NOW()
ORDER BY next_attempt_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED
FOR UPDATE SKIP LOCKED is the magic. It:
- ›Locks the rows for the current transaction
- ›Skips rows that are already locked by another transaction
This means you can run 10 worker processes on the same table, and they'll each pick up different events. No coordination required, no message broker, no Redis.
After claiming events, the worker delivers them and updates the status:
UPDATE events
SET status = 'delivered',
delivered_at = NOW()
WHERE id = $1
If delivery fails, schedule a retry:
UPDATE events
SET status = 'retry_scheduled',
next_attempt_at = NOW() + INTERVAL '30 seconds',
attempt_number = attempt_number + 1
WHERE id = $1
The Complete Schema
Here's the schema GetHook uses for events and delivery tracking:
CREATE TYPE event_status AS ENUM (
'received',
'queued',
'delivering',
'delivered',
'retry_scheduled',
'dead_letter',
'replayed'
);
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_id UUID NOT NULL REFERENCES accounts(id),
source_id UUID REFERENCES sources(id),
destination_id UUID REFERENCES destinations(id),
event_type TEXT,
payload JSONB NOT NULL,
status event_status NOT NULL DEFAULT 'received',
attempt_number INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 5,
next_attempt_at TIMESTAMPTZ DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
delivered_at TIMESTAMPTZ
);
-- Critical indexes for the worker poll query
CREATE INDEX idx_events_queue
ON events (status, next_attempt_at)
WHERE status IN ('queued', 'retry_scheduled');
CREATE INDEX idx_events_account
ON events (account_id, created_at DESC);
The partial index WHERE status IN ('queued', 'retry_scheduled') is critical. Without it, the worker scans the entire events table on every poll. With it, the index only includes undelivered events — typically a small fraction of the total.
Performance Benchmarks
Here's what to expect from a Postgres job queue on typical cloud hardware:
| Events / Second | Workers | DB CPU | Latency (p50) | Notes |
|---|---|---|---|---|
| 10 | 1 | < 5% | < 100ms | Single process |
| 100 | 5 | ~15% | < 200ms | Comfortable range |
| 500 | 20 | ~40% | < 500ms | Works well |
| 1,000 | 50 | ~65% | ~1s | Approaching limits |
| 2,000 | 100 | ~85% | ~3s | Consider partitioning |
| 5,000+ | — | — | — | Upgrade to Kafka |
For most SaaS companies (< 1,000 events/second = < 86M events/day), Postgres is more than sufficient and dramatically simpler than Kafka.
Why Not Redis?
Redis is often the first thing people reach for when they need a queue. LPUSH and BRPOP are fast, simple, and well-understood.
| Concern | PostgreSQL | Redis |
|---|---|---|
| Durability | WAL-based, fully durable | Depends on appendonly config |
| Exactly-once semantics | With FOR UPDATE SKIP LOCKED | Needs GETDEL + Lua scripting |
| Dead-letter queue | Just another row status | Separate list, manual management |
| Event history / replay | Rows persist, query anytime | Keys expire, no history |
| Visibility into queue | SQL query | Separate tooling (RedisInsight) |
| Additional infrastructure | Not needed (you have Postgres) | Another service to run + monitor |
| Cost | $0 additional | $50–$500/month for managed Redis |
The key insight: you already have Postgres. Webhook events need to be stored anyway (for replay, delivery logs, audit trails). Storing them in the same Postgres instance that serves as the queue means you don't need a separate system.
Why Not Kafka?
Kafka is excellent for very high-throughput event streaming (millions of events/second) with multiple consumer groups. It's also:
- ›Complex to operate (ZooKeeper or KRaft, brokers, partitions, consumer groups)
- ›Expensive in managed form ($200–$2,000/month on Confluent Cloud)
- ›Overkill for < 10K events/second
For GetHook's target use case (SMB to mid-market SaaS, up to ~1M events/month), Postgres is the right tool. We can always add Kafka later if a customer's scale demands it. Starting with Kafka means paying Kafka's operational complexity from day one.
Handling Worker Crashes: The Heartbeat Problem
One subtlety: if a worker crashes while holding a lock (between claiming an event and updating its status), the lock releases automatically when the database connection closes.
But here's the edge case: what if the worker successfully delivers the event to the destination but crashes before updating the database? The event is stuck in delivering status forever.
GetHook's fix: A cleanup job runs every 5 minutes:
-- Reset events stuck in 'delivering' for more than 2 minutes
UPDATE events
SET status = 'queued',
next_attempt_at = NOW()
WHERE status = 'delivering'
AND updated_at < NOW() - INTERVAL '2 minutes'
This might cause a duplicate delivery (the destination gets the same event twice), but that's preferable to losing the event. Combined with idempotency keys, this is safe.
Dead-Letter Queue Management
When an event reaches max_attempts, it moves to dead_letter status:
UPDATE events
SET status = 'dead_letter'
WHERE id = $1
AND attempt_number >= max_attempts
Dead-letter events are visible in the GetHook dashboard. You can inspect the delivery attempts, see the error responses, fix the underlying issue, and then replay the event:
-- Replay a dead-letter event
INSERT INTO events (
account_id, source_id, destination_id, event_type,
payload, status, original_event_id
)
SELECT account_id, source_id, destination_id, event_type,
payload, 'queued', id
FROM events
WHERE id = $1;
UPDATE events SET status = 'replayed' WHERE id = $1;
Scaling Beyond Postgres
If you outgrow the Postgres queue (typically > 5,000 events/second sustained), the migration path is:
- ›Partition the events table by
account_idorcreated_at— extends Postgres throughput significantly - ›Upgrade to Amazon SQS — managed, scales to any throughput, but no replay/history
- ›Upgrade to Kafka or Redpanda — best for multi-consumer scenarios at extreme scale
For GetHook, we've designed the delivery layer with an abstraction so we can swap the queue backend without changing the rest of the system. But for the vast majority of customers, Postgres is the right choice and we don't plan to change it.
The Underrated Advantage: Observability
With Postgres as your queue, every event is a row in a table. You can:
-- Events in dead-letter for account
SELECT * FROM events
WHERE account_id = $1
AND status = 'dead_letter'
ORDER BY created_at DESC;
-- Delivery success rate last 24 hours
SELECT
destination_id,
COUNT(*) as total,
COUNT(*) FILTER (WHERE status = 'delivered') as delivered,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'delivered') / COUNT(*), 2) as success_rate
FROM events
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY destination_id;
-- Average retry count for failed events
SELECT AVG(attempt_number) FROM events WHERE status = 'dead_letter';
With Kafka or Redis, you'd need a separate data store for this kind of historical analysis. With Postgres, it's just SQL.
Conclusion
The Postgres job queue pattern is underrated. For event volumes up to ~1M events/day, it provides:
- ›Full durability (WAL-backed)
- ›Concurrent worker support (
FOR UPDATE SKIP LOCKED) - ›Event history and replay (just rows in a table)
- ›Zero additional infrastructure
- ›SQL-based observability
Don't reach for Redis or Kafka until your volume genuinely demands it. Start simple, instrument well, and scale when you have data that says you need to.
This is the pattern GetHook uses in production. If you want to see the full implementation, the code is on GitHub.