Building an Event-Driven Payment System That Survived ~1 Million Requests
When a simple payment API becomes a distributed systems problem

Most payment APIs work fine at low traffic. Then you throw a few thousand concurrent requests at them and everything gets interesting.
Client retries turn into duplicate payments. Postgres starts contending on row locks. Kafka redelivers events you already handled. What looked like a simple REST API starts behaving like a distributed systems problem set.
I built SwiftPay to run into those issues on purpose.
The system splits request acceptance from payment settlement. Kafka, Redis, PostgreSQL, Spring Boot. The gateway returns HTTP 202 Accepted immediately. A separate ledger service settles transfers asynchronously, inside real transactions.
I wanted to see where the design would break, so I load-tested the gateway at about 250 requests/second. That run processed close to 1 million payment requests with a 0.01% HTTP failure rate and roughly 111 ms p95 latency on the accept path.
I also hit a PgBouncer transaction pooling bug that made Hibernate's server-side prepared statements fail under load. That forced me to think about transaction boundaries, idempotency, consumer retries, and concurrency. Not just payment CRUD.
What follows is what broke, why it broke, and what I changed.
| Metric | Result |
|---|---|
| Requests processed | ~999,809 |
| Sustained rate | 250 req/s |
| p95 latency | ~111 ms |
| HTTP failures | 0.01% |
| Stack | Kafka, Redis, PostgreSQL, Spring Boot |
GitHub: github.com/Raghav-byte/Swift-pay
Architecture
Three Spring Boot services share one PostgreSQL database in this repo. I wouldn't ship it that way in production, but it kept the repo simple and let me focus on the accept / settle / event boundaries.
| Service | Responsibility |
|---|---|
| payment-gateway (:8080) | Idempotency, cached balance hint, PENDING insert, Kafka after commit |
| ledger-service (:8081) | Consumes payment-initiated, ordered row locks, settlement |
| analytics-worker (:8082) | Append-only metrics on payment-completed (best-effort) |
Figure 1: SwiftPay separates request acceptance from settlement, so HTTP latency doesn't have to wait on database lock contention.
Gateway validates idempotency and balance, writes PENDING, publishes payment-initiated. Ledger locks accounts, debits/credits, emits payment-completed. Analytics worker aggregates events.
Infra: Supabase pooler :6543, Redis on gateway only, Kafka + ledger DLQ.
Key engineering lessons
| Topic | Takeaway |
|---|---|
| Accept vs settle | Keep the client-facing path short; move contended DB work behind a durable boundary. |
| PgBouncer + Hibernate | Transaction pooling breaks server-side prepared statements unless JDBC/Hibernate are configured for it. |
| Kafka parallelism | Throughput scales with partition count in the consumer group, not thread count alone. |
| Redis | Safe for idempotency hints and cache-aside reads; not for authoritative balances. |
| Idempotency | Required on the accept path (client retries) and implied on consumers (at-least-once delivery). |
afterCommit |
Never publish to Kafka or mark Redis "processed" until the DB transaction commits. |
Why not process payments synchronously?
A naïve payment API:
Client → API → DB updates → Response
At low traffic that's fine. Under concurrency, synchronous settlement ties up request threads, makes duplicate transfers worse when clients retry, couples API latency to DB lock time, and makes partial failures harder to isolate.
SwiftPay uses two phases:
Accept: validate, insert
PENDING, return 202Settle: debit/credit under
SELECT … FOR UPDATEin the ledger service
The gateway records intent. The ledger enforces invariants. That split is what makes idempotency keys, consumer retries, DLQs, and async analytics workable without blocking HTTP threads.
Why Kafka instead of @Async?
In-process @Async would have been simpler for a demo. I picked Kafka for durable buffering during spikes and independent scaling of settlement consumers. Retries, manual ack, and a DLQ map cleanly to infrastructure vs business failures. The tradeoff is topics, lag, and consumer groups.
What broke at scale (and how I fixed it)
At first, smoke tests at 10–50 RPS looked healthy. Dependency health endpoints returned 200. Business tests passed. When I pushed toward 250 RPS on the gateway, I hit a failure mode my unit tests never touched.
The symptom
Under sustained load the gateway started returning HTTP 500 with:
ERROR: prepared statement "S_82" does not exist
insert into sp_transactions (...)
Roughly 40% of requests failed in the 10–50 RPS range before tuning. CPU wasn't saturated. Connection churn and pooling were.
Isolating the bottleneck
While testing those failure runs, k6 reported ~40% http_req_failed on the gateway. payment-initiated publish volume dropped with the failed accepts. Ledger consumer lag stayed near zero, but not because consumers were keeping up. Fewer events were getting published.
That pointed at database connection handling (pooler + prepared statements), not Kafka throughput.
After the JDBC fix, a sustained 250 RPS run (946k requests, single partition) still showed growing lag on 1.75 s**. Settlement throughput, not the pooler. When I increased to 6 partitions and concurrency 3, p95 dropped to ~111 ms on a ~999k run.payment-initiated and gateway p95 around **
Why the error was misleading
PostgreSQL wasn't rejecting bad SQL. It was rejecting a named prepared statement. The failure needed all of these at once: Supabase's transaction pooler (:6543), transaction pooling mode, and Hibernate's default server-side prepared statements (S_82, and so on, tied to a specific backend session).
Root cause (mechanism)
In transaction pooling, PgBouncer assigns a backend connection per transaction, then returns it to the pool. The JDBC driver prepares once and reuses the name S_82. The next transaction may land on a different backend that never registered S_82. Pooling mode plus prepared statements. Not a Hibernate bug.
Figure 2: Under transaction pooling, a prepared statement created on one backend session isn't visible on the next assigned session. You get intermittent
S_Nerrors under load.
Fix
# application.yml (both services)
prepareThreshold: 0
hibernate.jdbc.use_server_prepared_stmts: false
Keep both while using Supabase's transaction pooler.
Failures often sit in the glue: poolers, statement caching, partition counts. Not in domain logic.
Tuning after the pooler fix
| Change | Why it mattered |
|---|---|
| Kafka after commit | No orphan payment-initiated events on rollback |
Hikari maximum-pool-size: 30 |
Less pool wait under 250 RPS |
6 partitions + ledger concurrency: 3 |
Parallel settlement (min(partitions, consumers)) |
Producer linger.ms: 5 |
Batched produces under sustained load |
Same Java transfer logic. ~1.75 s p95 (946k requests) went to **111 ms** p95 (~999k requests) after the pooler flags and partition tuning.
Load test results
Tool: k6 (Docker profile). Target: POST /v1/payments on gateway :8080, the latency-sensitive client-facing path.
| Run | Rate | HTTP requests | Error rate | p95 |
|---|---|---|---|---|
| Smoke | 10/s | ~600 | 0.00% | ~100 ms |
| Stress | 50/s | 3,001 | 0.00% | ~61 ms |
| Full | 250/s | 999,809 | 0.01% | ~111 ms |
Thresholds: http_req_failed < 1%, p(95) < 500ms. Details: PERFORMANCE.md.
Figure 3: k6 summary from the full 250 RPS run. ~999k requests, sub-1% failure rate, p95 inside the 500 ms threshold.
Figure 4: Gateway p95 dropped from ~1.75 s to ~111 ms after fixing PgBouncer prepared-statement handling and increasing Kafka partition parallelism.
Gateway: accept path under load
The gateway is what k6 hammers. Validate, dedupe, cached balance hint, persist PENDING, then hand off.
Currency: INR only; else 422
Idempotency:
swiftpay:idempotency:{transactionId}; duplicate → 409Balance: cache-aside
swiftpay:balance:{ownerId}, 30s TTL; insufficient → 422Persist:
PENDINGin PostgresAfter commit:
payment-initiated+ idempotency key
@Override
public void afterCommit() {
kafkaProducerService.emitPaymentInitiated(saved);
idempotencyService.store(dto.getTransactionId());
}
afterCommit: Kafka before commit → orphan settlements. Redis idempotency before commit → false duplicates after DB rollback.
Ledger: settlement and locking
The ledger transaction boundary owns balance correctness. The gateway only owns request acceptance.
The consumer processes payment-initiated with manual ack. Business failures go to payment-failed + ack. Transient failures get 3× retry, then payment-failed-dlq.
Ordered locking: lock the lower UUID first, then SELECT … FOR UPDATE on both accounts.
UUID first = event.senderId().compareTo(event.receiverId()) < 0
? event.senderId() : event.receiverId();
// findByOwnerIdForUpdate(first), then second; re-check balance; debit/credit
A gateway 202 means accepted for processing. Settlement sets COMPLETED or FAILED.
Kafka: partitions bound parallelism
| Topic | Flow |
|---|---|
payment-initiated |
gateway → ledger |
payment-completed |
ledger → analytics |
payment-failed-dlq |
exhausted retries |
6 partitions on payment-initiated, listener concurrency: 3. Useful parallelism = min(partitions, active consumers). Extra threads without partitions just add idle workers. The ~1.75 s p95 run improved after I bumped partition count, not after JVM tuning.
Idempotency
Accept path: EXISTS → DB commit → SET idempotency key (24h). Postgres UNIQUE(idempotency_key) as backstop. EXISTS + SET is not atomic. Production would need SET NX or insert-only idempotency.
Consumers assume at-least-once delivery. Ledger keys off transactionId / DB state.
Observability (what exists today)
| Signal | Where |
|---|---|
| Dependency health | /v1/health, /health/db, /redis, /kafka |
| Load metrics | k6 JSON in load-test/results/ |
| Failures | payment-failed-dlq |
| API | Springdoc OpenAPI per service |
Production would add consumer lag on payment-initiated, settlement duration histograms, trace propagation via transactionId, and alerts on DLQ growth and Hikari pool wait.
Known limitations
| Area | Behavior |
|---|---|
| Auth | Audit columns; no principal |
| Idempotency | Check-then-set race |
| Balance cache | 30s TTL vs settlement truth |
| Database | Shared Postgres |
| Analytics | Duplicate rows on redelivery OK for dashboards |
What I would change in production
Transactional outbox instead of Kafka-only
afterCommitAtomic idempotency (
SET NXor insert-only table)Invalidate balance cache on terminal settlement status
Auth + audit on events
Split analytics storage from OLTP
Tracing + SLOs on accept p95 vs settlement lag
Final Thoughts
Building SwiftPay changed how I think about backend work.
I expected the hard parts to be payment rules and Kafka wiring. The real pain was infrastructure under load. Transaction pooling. Partition counts. Commit ordering. Retries on clients and brokers. Lock contention. Those had more impact on correctness than the business logic.
A lot of backend systems look reliable in development because nobody stress-tests them. Load testing surfaced bugs my unit tests and local setup never would have caught. I had to reason about the whole system, not individual components.
This wasn't about shipping another payment API. I wanted to understand what it takes to keep one correct when thousands of requests hit at once.
If you've run into similar issues, I'd like to hear about them.



