Event-Driven Payment System: 1M Requests Load Test

Most payment APIs work fine at low traffic. Then you throw a few thousand concurrent requests at them and everything gets interesting.

Client retries turn into duplicate payments. Postgres starts contending on row locks. Kafka redelivers events you already handled. What looked like a simple REST API starts behaving like a distributed systems problem set.

I built SwiftPay to run into those issues on purpose.

The system splits request acceptance from payment settlement. Kafka, Redis, PostgreSQL, Spring Boot. The gateway returns HTTP 202 Accepted immediately. A separate ledger service settles transfers asynchronously, inside real transactions.

I wanted to see where the design would break, so I load-tested the gateway at about 250 requests/second. That run processed close to 1 million payment requests with a 0.01% HTTP failure rate and roughly 111 ms p95 latency on the accept path.

I also hit a PgBouncer transaction pooling bug that made Hibernate's server-side prepared statements fail under load. That forced me to think about transaction boundaries, idempotency, consumer retries, and concurrency. Not just payment CRUD.

What follows is what broke, why it broke, and what I changed.

Metric	Result
Requests processed	~999,809
Sustained rate	250 req/s
p95 latency	~111 ms
HTTP failures	0.01%
Stack	Kafka, Redis, PostgreSQL, Spring Boot

GitHub: github.com/Raghav-byte/Swift-pay

Architecture

Three Spring Boot services share one PostgreSQL database in this repo. I wouldn't ship it that way in production, but it kept the repo simple and let me focus on the accept / settle / event boundaries.

Service	Responsibility
payment-gateway (:8080)	Idempotency, cached balance hint, `PENDING` insert, Kafka after commit
ledger-service (:8081)	Consumes `payment-initiated`, ordered row locks, settlement
analytics-worker (:8082)	Append-only metrics on `payment-completed` (best-effort)

Figure 1: SwiftPay separates request acceptance from settlement, so HTTP latency doesn't have to wait on database lock contention.

Gateway validates idempotency and balance, writes PENDING, publishes payment-initiated. Ledger locks accounts, debits/credits, emits payment-completed. Analytics worker aggregates events.

Infra: Supabase pooler :6543, Redis on gateway only, Kafka + ledger DLQ.

Key engineering lessons

Topic	Takeaway
Accept vs settle	Keep the client-facing path short; move contended DB work behind a durable boundary.
PgBouncer + Hibernate	Transaction pooling breaks server-side prepared statements unless JDBC/Hibernate are configured for it.
Kafka parallelism	Throughput scales with partition count in the consumer group, not thread count alone.
Redis	Safe for idempotency hints and cache-aside reads; not for authoritative balances.
Idempotency	Required on the accept path (client retries) and implied on consumers (at-least-once delivery).
`afterCommit`	Never publish to Kafka or mark Redis "processed" until the DB transaction commits.

Why not process payments synchronously?

A naïve payment API:

Client → API → DB updates → Response

At low traffic that's fine. Under concurrency, synchronous settlement ties up request threads, makes duplicate transfers worse when clients retry, couples API latency to DB lock time, and makes partial failures harder to isolate.

SwiftPay uses two phases:

Accept: validate, insert PENDING, return 202
Settle: debit/credit under SELECT … FOR UPDATE in the ledger service

The gateway records intent. The ledger enforces invariants. That split is what makes idempotency keys, consumer retries, DLQs, and async analytics workable without blocking HTTP threads.

Why Kafka instead of `@Async`?

In-process @Async would have been simpler for a demo. I picked Kafka for durable buffering during spikes and independent scaling of settlement consumers. Retries, manual ack, and a DLQ map cleanly to infrastructure vs business failures. The tradeoff is topics, lag, and consumer groups.

What broke at scale (and how I fixed it)

At first, smoke tests at 10–50 RPS looked healthy. Dependency health endpoints returned 200. Business tests passed. When I pushed toward 250 RPS on the gateway, I hit a failure mode my unit tests never touched.

The symptom

Under sustained load the gateway started returning HTTP 500 with:

ERROR: prepared statement "S_82" does not exist
insert into sp_transactions (...)

Roughly 40% of requests failed in the 10–50 RPS range before tuning. CPU wasn't saturated. Connection churn and pooling were.

Isolating the bottleneck

While testing those failure runs, k6 reported ~40% http_req_failed on the gateway. payment-initiated publish volume dropped with the failed accepts. Ledger consumer lag stayed near zero, but not because consumers were keeping up. Fewer events were getting published.

That pointed at database connection handling (pooler + prepared statements), not Kafka throughput.

After the JDBC fix, a sustained 250 RPS run (946k requests, single partition) still showed growing lag on payment-initiated and gateway p95 around **1.75 s**. Settlement throughput, not the pooler. When I increased to 6 partitions and concurrency 3, p95 dropped to ~111 ms on a ~999k run.

Why the error was misleading

PostgreSQL wasn't rejecting bad SQL. It was rejecting a named prepared statement. The failure needed all of these at once: Supabase's transaction pooler (:6543), transaction pooling mode, and Hibernate's default server-side prepared statements (S_82, and so on, tied to a specific backend session).

Root cause (mechanism)

In transaction pooling, PgBouncer assigns a backend connection per transaction, then returns it to the pool. The JDBC driver prepares once and reuses the name S_82. The next transaction may land on a different backend that never registered S_82. Pooling mode plus prepared statements. Not a Hibernate bug.

PgBouncer prepared-statement failure timeline

Figure 2: Under transaction pooling, a prepared statement created on one backend session isn't visible on the next assigned session. You get intermittent S_N errors under load.

Fix

# application.yml (both services)
prepareThreshold: 0
hibernate.jdbc.use_server_prepared_stmts: false

Keep both while using Supabase's transaction pooler.

Failures often sit in the glue: poolers, statement caching, partition counts. Not in domain logic.

Tuning after the pooler fix

Change	Why it mattered
Kafka after commit	No orphan `payment-initiated` events on rollback
Hikari `maximum-pool-size: 30`	Less pool wait under 250 RPS
6 partitions + ledger `concurrency: 3`	Parallel settlement (`min(partitions, consumers)`)
Producer `linger.ms: 5`	Batched produces under sustained load

Same Java transfer logic. ~1.75 s p95 (946k requests) went to **111 ms** p95 (~999k requests) after the pooler flags and partition tuning.

Load test results

Tool: k6 (Docker profile). Target: POST /v1/payments on gateway :8080, the latency-sensitive client-facing path.

Run	Rate	HTTP requests	Error rate	p95
Smoke	10/s	~600	0.00%	~100 ms
Stress	50/s	3,001	0.00%	~61 ms
Full	250/s	999,809	0.01%	~111 ms

Thresholds: http_req_failed < 1%, p(95) < 500ms. Details: PERFORMANCE.md.

Figure 3: k6 summary from the full 250 RPS run. ~999k requests, sub-1% failure rate, p95 inside the 500 ms threshold.

p95 latency before vs after pooler fix and Kafka tuning

Figure 4: Gateway p95 dropped from ~1.75 s to ~111 ms after fixing PgBouncer prepared-statement handling and increasing Kafka partition parallelism.

Gateway: accept path under load

The gateway is what k6 hammers. Validate, dedupe, cached balance hint, persist PENDING, then hand off.

Currency: INR only; else 422
Idempotency: swiftpay:idempotency:{transactionId}; duplicate → 409
Balance: cache-aside swiftpay:balance:{ownerId}, 30s TTL; insufficient → 422
Persist: PENDING in Postgres
After commit: payment-initiated + idempotency key

@Override
public void afterCommit() {
    kafkaProducerService.emitPaymentInitiated(saved);
    idempotencyService.store(dto.getTransactionId());
}

afterCommit: Kafka before commit → orphan settlements. Redis idempotency before commit → false duplicates after DB rollback.

Ledger: settlement and locking

The ledger transaction boundary owns balance correctness. The gateway only owns request acceptance.

The consumer processes payment-initiated with manual ack. Business failures go to payment-failed + ack. Transient failures get 3× retry, then payment-failed-dlq.

Ordered locking: lock the lower UUID first, then SELECT … FOR UPDATE on both accounts.

UUID first = event.senderId().compareTo(event.receiverId()) < 0
        ? event.senderId() : event.receiverId();
// findByOwnerIdForUpdate(first), then second; re-check balance; debit/credit

A gateway 202 means accepted for processing. Settlement sets COMPLETED or FAILED.

Kafka: partitions bound parallelism

Topic	Flow
`payment-initiated`	gateway → ledger
`payment-completed`	ledger → analytics
`payment-failed-dlq`	exhausted retries

6 partitions on payment-initiated, listener concurrency: 3. Useful parallelism = min(partitions, active consumers). Extra threads without partitions just add idle workers. The ~1.75 s p95 run improved after I bumped partition count, not after JVM tuning.

Idempotency

Accept path: EXISTS → DB commit → SET idempotency key (24h). Postgres UNIQUE(idempotency_key) as backstop. EXISTS + SET is not atomic. Production would need SET NX or insert-only idempotency.

Consumers assume at-least-once delivery. Ledger keys off transactionId / DB state.

Observability (what exists today)

Signal	Where
Dependency health	`/v1/health`, `/health/db`, `/redis`, `/kafka`
Load metrics	k6 JSON in `load-test/results/`
Failures	`payment-failed-dlq`
API	Springdoc OpenAPI per service

Production would add consumer lag on payment-initiated, settlement duration histograms, trace propagation via transactionId, and alerts on DLQ growth and Hikari pool wait.

Known limitations

Area	Behavior
Auth	Audit columns; no principal
Idempotency	Check-then-set race
Balance cache	30s TTL vs settlement truth
Database	Shared Postgres
Analytics	Duplicate rows on redelivery OK for dashboards

What I would change in production

Transactional outbox instead of Kafka-only afterCommit
Atomic idempotency (SET NX or insert-only table)
Invalidate balance cache on terminal settlement status
Auth + audit on events
Split analytics storage from OLTP
Tracing + SLOs on accept p95 vs settlement lag

Final Thoughts

Building SwiftPay changed how I think about backend work.

I expected the hard parts to be payment rules and Kafka wiring. The real pain was infrastructure under load. Transaction pooling. Partition counts. Commit ordering. Retries on clients and brokers. Lock contention. Those had more impact on correctness than the business logic.

A lot of backend systems look reliable in development because nobody stress-tests them. Load testing surfaced bugs my unit tests and local setup never would have caught. I had to reason about the whole system, not individual components.

This wasn't about shipping another payment API. I wanted to understand what it takes to keep one correct when thousands of requests hit at once.

If you've run into similar issues, I'd like to hear about them.

github.com/Raghav-byte/Swift-pay

Building an Event-Driven Payment System That Survived ~1 Million Requests

Architecture

Key engineering lessons

Why not process payments synchronously?

Why Kafka instead of `@Async`?

What broke at scale (and how I fixed it)

The symptom

Isolating the bottleneck

Why the error was misleading

Root cause (mechanism)

Fix

Tuning after the pooler fix

Load test results

Gateway: accept path under load

Ledger: settlement and locking

Kafka: partitions bound parallelism

Idempotency

Observability (what exists today)

Known limitations

What I would change in production

Final Thoughts

Comments

More from this blog

Understanding the Core Concepts of AI: Part 1

System Design 101 : Scale from Zero to Millions of Users

How GZIP Compression Made My Spring Boot App 80% Lighter

Reddit Backend Project Using Microservice Architecture

Command Palette

Architecture

Key engineering lessons

Why not process payments synchronously?

Why Kafka instead of @Async?

What broke at scale (and how I fixed it)

The symptom

Isolating the bottleneck

Why the error was misleading

Root cause (mechanism)

Fix

Tuning after the pooler fix

Load test results

Gateway: accept path under load

Ledger: settlement and locking

Kafka: partitions bound parallelism

Idempotency

Observability (what exists today)

Known limitations

What I would change in production

Final Thoughts

Comments

More from this blog

Why Kafka instead of `@Async`?