Skip to main content

Command Palette

Search for a command to run...

Building an Event-Driven Payment System That Survived ~1 Million Requests

When a simple payment API becomes a distributed systems problem

Updated
10 min readView as Markdown
Building an Event-Driven Payment System That Survived ~1 Million Requests
R
Passionate software developer crafting elegant solutions through code and innovation.

Most payment APIs work fine at low traffic. Then you throw a few thousand concurrent requests at them and everything gets interesting.

Client retries turn into duplicate payments. Postgres starts contending on row locks. Kafka redelivers events you already handled. What looked like a simple REST API starts behaving like a distributed systems problem set.

I built SwiftPay to run into those issues on purpose.

The system splits request acceptance from payment settlement. Kafka, Redis, PostgreSQL, Spring Boot. The gateway returns HTTP 202 Accepted immediately. A separate ledger service settles transfers asynchronously, inside real transactions.

I wanted to see where the design would break, so I load-tested the gateway at about 250 requests/second. That run processed close to 1 million payment requests with a 0.01% HTTP failure rate and roughly 111 ms p95 latency on the accept path.

I also hit a PgBouncer transaction pooling bug that made Hibernate's server-side prepared statements fail under load. That forced me to think about transaction boundaries, idempotency, consumer retries, and concurrency. Not just payment CRUD.

What follows is what broke, why it broke, and what I changed.

Metric Result
Requests processed ~999,809
Sustained rate 250 req/s
p95 latency ~111 ms
HTTP failures 0.01%
Stack Kafka, Redis, PostgreSQL, Spring Boot

GitHub: github.com/Raghav-byte/Swift-pay


Architecture

Three Spring Boot services share one PostgreSQL database in this repo. I wouldn't ship it that way in production, but it kept the repo simple and let me focus on the accept / settle / event boundaries.

Service Responsibility
payment-gateway (:8080) Idempotency, cached balance hint, PENDING insert, Kafka after commit
ledger-service (:8081) Consumes payment-initiated, ordered row locks, settlement
analytics-worker (:8082) Append-only metrics on payment-completed (best-effort)

Figure 1: SwiftPay separates request acceptance from settlement, so HTTP latency doesn't have to wait on database lock contention.

Gateway validates idempotency and balance, writes PENDING, publishes payment-initiated. Ledger locks accounts, debits/credits, emits payment-completed. Analytics worker aggregates events.

Infra: Supabase pooler :6543, Redis on gateway only, Kafka + ledger DLQ.


Key engineering lessons

Topic Takeaway
Accept vs settle Keep the client-facing path short; move contended DB work behind a durable boundary.
PgBouncer + Hibernate Transaction pooling breaks server-side prepared statements unless JDBC/Hibernate are configured for it.
Kafka parallelism Throughput scales with partition count in the consumer group, not thread count alone.
Redis Safe for idempotency hints and cache-aside reads; not for authoritative balances.
Idempotency Required on the accept path (client retries) and implied on consumers (at-least-once delivery).
afterCommit Never publish to Kafka or mark Redis "processed" until the DB transaction commits.

Why not process payments synchronously?

A naïve payment API:

Client → API → DB updates → Response

At low traffic that's fine. Under concurrency, synchronous settlement ties up request threads, makes duplicate transfers worse when clients retry, couples API latency to DB lock time, and makes partial failures harder to isolate.

SwiftPay uses two phases:

  1. Accept: validate, insert PENDING, return 202

  2. Settle: debit/credit under SELECT … FOR UPDATE in the ledger service

The gateway records intent. The ledger enforces invariants. That split is what makes idempotency keys, consumer retries, DLQs, and async analytics workable without blocking HTTP threads.

Why Kafka instead of @Async?

In-process @Async would have been simpler for a demo. I picked Kafka for durable buffering during spikes and independent scaling of settlement consumers. Retries, manual ack, and a DLQ map cleanly to infrastructure vs business failures. The tradeoff is topics, lag, and consumer groups.


What broke at scale (and how I fixed it)

At first, smoke tests at 10–50 RPS looked healthy. Dependency health endpoints returned 200. Business tests passed. When I pushed toward 250 RPS on the gateway, I hit a failure mode my unit tests never touched.

The symptom

Under sustained load the gateway started returning HTTP 500 with:

ERROR: prepared statement "S_82" does not exist
insert into sp_transactions (...)

Roughly 40% of requests failed in the 10–50 RPS range before tuning. CPU wasn't saturated. Connection churn and pooling were.

Isolating the bottleneck

While testing those failure runs, k6 reported ~40% http_req_failed on the gateway. payment-initiated publish volume dropped with the failed accepts. Ledger consumer lag stayed near zero, but not because consumers were keeping up. Fewer events were getting published.

That pointed at database connection handling (pooler + prepared statements), not Kafka throughput.

After the JDBC fix, a sustained 250 RPS run (946k requests, single partition) still showed growing lag on payment-initiated and gateway p95 around **1.75 s**. Settlement throughput, not the pooler. When I increased to 6 partitions and concurrency 3, p95 dropped to ~111 ms on a ~999k run.

Why the error was misleading

PostgreSQL wasn't rejecting bad SQL. It was rejecting a named prepared statement. The failure needed all of these at once: Supabase's transaction pooler (:6543), transaction pooling mode, and Hibernate's default server-side prepared statements (S_82, and so on, tied to a specific backend session).

Root cause (mechanism)

In transaction pooling, PgBouncer assigns a backend connection per transaction, then returns it to the pool. The JDBC driver prepares once and reuses the name S_82. The next transaction may land on a different backend that never registered S_82. Pooling mode plus prepared statements. Not a Hibernate bug.

PgBouncer prepared-statement failure timeline

Figure 2: Under transaction pooling, a prepared statement created on one backend session isn't visible on the next assigned session. You get intermittent S_N errors under load.

Fix

# application.yml (both services)
prepareThreshold: 0
hibernate.jdbc.use_server_prepared_stmts: false

Keep both while using Supabase's transaction pooler.


Failures often sit in the glue: poolers, statement caching, partition counts. Not in domain logic.


Tuning after the pooler fix

Change Why it mattered
Kafka after commit No orphan payment-initiated events on rollback
Hikari maximum-pool-size: 30 Less pool wait under 250 RPS
6 partitions + ledger concurrency: 3 Parallel settlement (min(partitions, consumers))
Producer linger.ms: 5 Batched produces under sustained load

Same Java transfer logic. ~1.75 s p95 (946k requests) went to **111 ms** p95 (~999k requests) after the pooler flags and partition tuning.


Load test results

Tool: k6 (Docker profile). Target: POST /v1/payments on gateway :8080, the latency-sensitive client-facing path.

Run Rate HTTP requests Error rate p95
Smoke 10/s ~600 0.00% ~100 ms
Stress 50/s 3,001 0.00% ~61 ms
Full 250/s 999,809 0.01% ~111 ms

Thresholds: http_req_failed < 1%, p(95) < 500ms. Details: PERFORMANCE.md.

k6 summary, full run at 250 RPS

Figure 3: k6 summary from the full 250 RPS run. ~999k requests, sub-1% failure rate, p95 inside the 500 ms threshold.

p95 latency before vs after pooler fix and Kafka tuning

Figure 4: Gateway p95 dropped from ~1.75 s to ~111 ms after fixing PgBouncer prepared-statement handling and increasing Kafka partition parallelism.


Gateway: accept path under load

The gateway is what k6 hammers. Validate, dedupe, cached balance hint, persist PENDING, then hand off.

  1. Currency: INR only; else 422

  2. Idempotency: swiftpay:idempotency:{transactionId}; duplicate → 409

  3. Balance: cache-aside swiftpay:balance:{ownerId}, 30s TTL; insufficient → 422

  4. Persist: PENDING in Postgres

  5. After commit: payment-initiated + idempotency key

@Override
public void afterCommit() {
    kafkaProducerService.emitPaymentInitiated(saved);
    idempotencyService.store(dto.getTransactionId());
}

afterCommit: Kafka before commit → orphan settlements. Redis idempotency before commit → false duplicates after DB rollback.


Ledger: settlement and locking

The ledger transaction boundary owns balance correctness. The gateway only owns request acceptance.

The consumer processes payment-initiated with manual ack. Business failures go to payment-failed + ack. Transient failures get 3× retry, then payment-failed-dlq.

Ordered locking: lock the lower UUID first, then SELECT … FOR UPDATE on both accounts.

UUID first = event.senderId().compareTo(event.receiverId()) < 0
        ? event.senderId() : event.receiverId();
// findByOwnerIdForUpdate(first), then second; re-check balance; debit/credit

A gateway 202 means accepted for processing. Settlement sets COMPLETED or FAILED.


Kafka: partitions bound parallelism

Topic Flow
payment-initiated gateway → ledger
payment-completed ledger → analytics
payment-failed-dlq exhausted retries

6 partitions on payment-initiated, listener concurrency: 3. Useful parallelism = min(partitions, active consumers). Extra threads without partitions just add idle workers. The ~1.75 s p95 run improved after I bumped partition count, not after JVM tuning.


Idempotency

Accept path: EXISTS → DB commit → SET idempotency key (24h). Postgres UNIQUE(idempotency_key) as backstop. EXISTS + SET is not atomic. Production would need SET NX or insert-only idempotency.

Consumers assume at-least-once delivery. Ledger keys off transactionId / DB state.


Observability (what exists today)

Signal Where
Dependency health /v1/health, /health/db, /redis, /kafka
Load metrics k6 JSON in load-test/results/
Failures payment-failed-dlq
API Springdoc OpenAPI per service

Production would add consumer lag on payment-initiated, settlement duration histograms, trace propagation via transactionId, and alerts on DLQ growth and Hikari pool wait.


Known limitations

Area Behavior
Auth Audit columns; no principal
Idempotency Check-then-set race
Balance cache 30s TTL vs settlement truth
Database Shared Postgres
Analytics Duplicate rows on redelivery OK for dashboards

What I would change in production

  1. Transactional outbox instead of Kafka-only afterCommit

  2. Atomic idempotency (SET NX or insert-only table)

  3. Invalidate balance cache on terminal settlement status

  4. Auth + audit on events

  5. Split analytics storage from OLTP

  6. Tracing + SLOs on accept p95 vs settlement lag


Final Thoughts

Building SwiftPay changed how I think about backend work.

I expected the hard parts to be payment rules and Kafka wiring. The real pain was infrastructure under load. Transaction pooling. Partition counts. Commit ordering. Retries on clients and brokers. Lock contention. Those had more impact on correctness than the business logic.

A lot of backend systems look reliable in development because nobody stress-tests them. Load testing surfaced bugs my unit tests and local setup never would have caught. I had to reason about the whole system, not individual components.

This wasn't about shipping another payment API. I wanted to understand what it takes to keep one correct when thousands of requests hit at once.

If you've run into similar issues, I'd like to hear about them.

github.com/Raghav-byte/Swift-pay