# Building an Event-Driven Payment System That Survived ~1 Million Requests

Most payment APIs work fine at low traffic. Then you throw a few thousand concurrent requests at them and everything gets interesting.

Client retries turn into duplicate payments. Postgres starts contending on row locks. Kafka redelivers events you already handled. What looked like a simple REST API starts behaving like a distributed systems problem set.

I built **SwiftPay** to run into those issues on purpose.

The system splits **request acceptance** from **payment settlement**. Kafka, Redis, PostgreSQL, Spring Boot. The gateway returns **HTTP 202 Accepted** immediately. A separate ledger service settles transfers asynchronously, inside real transactions.

I wanted to see where the design would break, so I load-tested the gateway at about **250 requests/second**. That run processed close to **1 million payment requests** with a **0.01% HTTP failure rate** and roughly **111 ms** p95 latency on the accept path.

I also hit a PgBouncer transaction pooling bug that made Hibernate's server-side prepared statements fail under load. That forced me to think about transaction boundaries, idempotency, consumer retries, and concurrency. Not just payment CRUD.

What follows is what broke, why it broke, and what I changed.

| Metric | Result |
| --- | --- |
| Requests processed | ~999,809 |
| Sustained rate | 250 req/s |
| p95 latency | ~111 ms |
| HTTP failures | 0.01% |
| Stack | Kafka, Redis, PostgreSQL, Spring Boot |

**GitHub:** [github.com/Raghav-byte/Swift-pay](https://github.com/Raghav-byte/Swift-pay)

* * *

## Architecture

Three Spring Boot services share one PostgreSQL database in this repo. I wouldn't ship it that way in production, but it kept the repo simple and let me focus on the **accept / settle / event** boundaries.

| Service | Responsibility |
| --- | --- |
| **payment-gateway** (:8080) | Idempotency, cached balance hint, `PENDING` insert, Kafka after commit |
| **ledger-service** (:8081) | Consumes `payment-initiated`, ordered row locks, settlement |
| **analytics-worker** (:8082) | Append-only metrics on `payment-completed` (best-effort) |

![](https://cdn.hashnode.com/uploads/covers/654e7fc6ddd252a0f42990d0/7b8386c6-611a-40d6-88f8-d85245715371.png align="center")

> **Figure 1:** SwiftPay separates request acceptance from settlement, so HTTP latency doesn't have to wait on database lock contention.

Gateway validates idempotency and balance, writes `PENDING`, publishes `payment-initiated`. Ledger locks accounts, debits/credits, emits `payment-completed`. Analytics worker aggregates events.

**Infra:** Supabase pooler **:6543**, Redis on gateway only, Kafka + ledger **DLQ**.

* * *

### Key engineering lessons

| Topic | Takeaway |
| --- | --- |
| **Accept vs settle** | Keep the client-facing path short; move contended DB work behind a durable boundary. |
| **PgBouncer + Hibernate** | Transaction pooling breaks server-side prepared statements unless JDBC/Hibernate are configured for it. |
| **Kafka parallelism** | Throughput scales with **partition count** in the consumer group, not thread count alone. |
| **Redis** | Safe for idempotency hints and cache-aside reads; not for authoritative balances. |
| **Idempotency** | Required on the accept path (client retries) and implied on consumers (at-least-once delivery). |
| `afterCommit` | Never publish to Kafka or mark Redis "processed" until the DB transaction commits. |

* * *

## Why not process payments synchronously?

A naïve payment API:

```text
Client → API → DB updates → Response
```

At low traffic that's fine. Under concurrency, synchronous settlement ties up request threads, makes duplicate transfers worse when clients retry, couples API latency to DB lock time, and makes partial failures harder to isolate.

SwiftPay uses two phases:

1.  **Accept**: validate, insert `PENDING`, return **202**
    
2.  **Settle**: debit/credit under `SELECT … FOR UPDATE` in the ledger service
    

The gateway records intent. The ledger enforces invariants. That split is what makes idempotency keys, consumer retries, DLQs, and async analytics workable without blocking HTTP threads.

### Why Kafka instead of `@Async`?

In-process `@Async` would have been simpler for a demo. I picked Kafka for **durable buffering** during spikes and **independent scaling** of settlement consumers. Retries, manual ack, and a DLQ map cleanly to infrastructure vs business failures. The tradeoff is topics, lag, and consumer groups.

* * *

## What broke at scale (and how I fixed it)

At first, smoke tests at 10–50 RPS looked healthy. Dependency health endpoints returned **200**. Business tests passed. When I pushed toward **250 RPS** on the gateway, I hit a failure mode my unit tests never touched.

### The symptom

Under sustained load the gateway started returning **HTTP 500** with:

```text
ERROR: prepared statement "S_82" does not exist
insert into sp_transactions (...)
```

Roughly **40%** of requests failed in the 10–50 RPS range before tuning. CPU wasn't saturated. Connection churn and pooling were.

### Isolating the bottleneck

While testing those failure runs, k6 reported **~40%** `http_req_failed` on the gateway. `payment-initiated` **publish volume dropped** with the failed accepts. Ledger **consumer lag stayed near zero**, but not because consumers were keeping up. Fewer events were getting published.

That pointed at **database connection handling** (pooler + prepared statements), not Kafka throughput.

After the JDBC fix, a sustained **250 RPS** run (~946k requests, **single partition**) still showed **growing lag** on `payment-initiated` and gateway p95 around **~1.75 s**. Settlement throughput, not the pooler. When I increased to **6 partitions** and **concurrency 3**, p95 dropped to **~111 ms** on a **~999k** run.

### Why the error was misleading

PostgreSQL wasn't rejecting bad SQL. It was rejecting a named prepared statement. The failure needed all of these at once: **Supabase's transaction pooler** (`:6543`), **transaction pooling** mode, and Hibernate's default **server-side prepared statements** (`S_82`, and so on, tied to a specific backend session).

### Root cause (mechanism)

In **transaction pooling**, PgBouncer assigns a backend connection per transaction, then returns it to the pool. The JDBC driver prepares once and reuses the name `S_82`. The next transaction may land on a **different** backend that never registered `S_82`. Pooling mode plus prepared statements. Not a Hibernate bug.

![PgBouncer prepared-statement failure timeline](https://raw.githubusercontent.com/Raghav-byte/Swift-pay/main/docs/images/pgbouncer-failure-timeline.svg align="center")

> **Figure 2:** Under transaction pooling, a prepared statement created on one backend session isn't visible on the next assigned session. You get intermittent `S_N` errors under load.

### Fix

```yaml
# application.yml (both services)
prepareThreshold: 0
hibernate.jdbc.use_server_prepared_stmts: false
```

Keep both while using Supabase's transaction pooler.

* * *

> **Failures often sit in the glue:** poolers, statement caching, partition counts. Not in domain logic.

* * *

### Tuning after the pooler fix

| Change | Why it mattered |
| --- | --- |
| Kafka **after commit** | No orphan `payment-initiated` events on rollback |
| Hikari `maximum-pool-size: 30` | Less pool wait under 250 RPS |
| **6 partitions** + ledger `concurrency: 3` | Parallel settlement (`min(partitions, consumers)`) |
| Producer `linger.ms: 5` | Batched produces under sustained load |

Same Java transfer logic. **~1.75 s** p95 (~946k requests) went to **~111 ms** p95 (~999k requests) after the pooler flags and partition tuning.

* * *

## Load test results

**Tool:** k6 (Docker profile). **Target:** `POST /v1/payments` on gateway **:8080**, the latency-sensitive client-facing path.

| Run | Rate | HTTP requests | Error rate | p95 |
| --- | --- | --- | --- | --- |
| Smoke | 10/s | ~600 | 0.00% | ~100 ms |
| Stress | 50/s | 3,001 | 0.00% | ~61 ms |
| **Full** | **250/s** | **999,809** | **0.01%** | **~111 ms** |

Thresholds: `http_req_failed < 1%`, `p(95) < 500ms`. Details: [PERFORMANCE.md](https://github.com/Raghav-byte/Swift-pay/blob/main/docs/PERFORMANCE.md).

![k6 summary, full run at 250 RPS](https://raw.githubusercontent.com/Raghav-byte/Swift-pay/main/docs/images/k6-summary-250rps.svg align="center")

> **Figure 3:** k6 summary from the full 250 RPS run. ~999k requests, sub-1% failure rate, p95 inside the 500 ms threshold.

![p95 latency before vs after pooler fix and Kafka tuning](https://raw.githubusercontent.com/Raghav-byte/Swift-pay/main/docs/images/load-test-p95-comparison.svg align="center")

> **Figure 4:** Gateway p95 dropped from ~1.75 s to ~111 ms after fixing PgBouncer prepared-statement handling and increasing Kafka partition parallelism.

* * *

## Gateway: accept path under load

The gateway is what k6 hammers. Validate, dedupe, **cached balance hint**, persist `PENDING`, then hand off.

1.  **Currency**: INR only; else **422**
    
2.  **Idempotency**: `swiftpay:idempotency:{transactionId}`; duplicate → **409**
    
3.  **Balance**: cache-aside `swiftpay:balance:{ownerId}`, 30s TTL; insufficient → **422**
    
4.  **Persist**: `PENDING` in Postgres
    
5.  **After commit**: `payment-initiated` + idempotency key
    

```java
@Override
public void afterCommit() {
    kafkaProducerService.emitPaymentInitiated(saved);
    idempotencyService.store(dto.getTransactionId());
}
```

* * *

> `afterCommit`: Kafka before commit → orphan settlements. Redis idempotency before commit → false duplicates after DB rollback.

* * *

## Ledger: settlement and locking

**The ledger transaction boundary owns balance correctness. The gateway only owns request acceptance.**

The consumer processes `payment-initiated` with **manual ack**. Business failures go to `payment-failed` + ack. Transient failures get 3× retry, then `payment-failed-dlq`.

**Ordered locking:** lock the **lower UUID** first, then `SELECT … FOR UPDATE` on both accounts.

```java
UUID first = event.senderId().compareTo(event.receiverId()) < 0
        ? event.senderId() : event.receiverId();
// findByOwnerIdForUpdate(first), then second; re-check balance; debit/credit
```

A gateway **202** means accepted for processing. Settlement sets `COMPLETED` or `FAILED`.

* * *

## Kafka: partitions bound parallelism

| Topic | Flow |
| --- | --- |
| `payment-initiated` | gateway → ledger |
| `payment-completed` | ledger → analytics |
| `payment-failed-dlq` | exhausted retries |

**6 partitions** on `payment-initiated`, listener `concurrency: 3`. Useful parallelism = `min(partitions, active consumers)`. Extra threads without partitions just add idle workers. The **~1.75 s** p95 run improved after I bumped partition count, not after JVM tuning.

* * *

## Idempotency

Accept path: `EXISTS` → DB commit → `SET` idempotency key (24h). Postgres `UNIQUE(idempotency_key)` as backstop. `EXISTS` **+** `SET` **is not atomic**. Production would need `SET NX` or insert-only idempotency.

Consumers assume **at-least-once** delivery. Ledger keys off `transactionId` / DB state.

* * *

## Observability (what exists today)

| Signal | Where |
| --- | --- |
| Dependency health | `/v1/health`, `/health/db`, `/redis`, `/kafka` |
| Load metrics | k6 JSON in `load-test/results/` |
| Failures | `payment-failed-dlq` |
| API | Springdoc OpenAPI per service |

Production would add consumer lag on `payment-initiated`, settlement duration histograms, trace propagation via `transactionId`, and alerts on DLQ growth and Hikari pool wait.

* * *

## Known limitations

| Area | Behavior |
| --- | --- |
| Auth | Audit columns; no principal |
| Idempotency | Check-then-set race |
| Balance cache | 30s TTL vs settlement truth |
| Database | Shared Postgres |
| Analytics | Duplicate rows on redelivery OK for dashboards |

* * *

## What I would change in production

1.  **Transactional outbox** instead of Kafka-only `afterCommit`
    
2.  **Atomic idempotency** (`SET NX` or insert-only table)
    
3.  **Invalidate balance cache** on terminal settlement status
    
4.  **Auth + audit** on events
    
5.  **Split analytics storage** from OLTP
    
6.  **Tracing + SLOs** on accept p95 vs settlement lag
    

* * *

## Final Thoughts

Building SwiftPay changed how I think about backend work.

I expected the hard parts to be payment rules and Kafka wiring. The real pain was infrastructure under load. Transaction pooling. Partition counts. Commit ordering. Retries on clients and brokers. Lock contention. Those had more impact on correctness than the business logic.

A lot of backend systems look reliable in development because nobody stress-tests them. Load testing surfaced bugs my unit tests and local setup never would have caught. I had to reason about the whole system, not individual components.

This wasn't about shipping another payment API. I wanted to understand what it takes to keep one correct when thousands of requests hit at once.

If you've run into similar issues, I'd like to hear about them.

[**github.com/Raghav-byte/Swift-pay**](https://github.com/Raghav-byte/Swift-pay)
