09. Design Payment System¶

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: You are on the stage, pitching city council with a blueprint. Follow the choreography, show reasoning aloud, and admit every honest gap.

Step 1: Requirements & Constraints¶

Start wide. Then narrow. See. Scope first, technology later.

Functional requirements¶

Accept payment intents, authorisations, captures, refunds, and status queries.
Guarantee idempotent request handling for all externally retried actions.
Record every money movement in a double-entry ledger.
Integrate with card processors, banks, and internal order systems.
Run reconciliation against processor reports and bank settlements.
Store only PCI-minimised data and tokenised payment instruments.

Non-functional requirements¶

Never create or lose money because of duplicates or race conditions.
Keep ledger correctness above convenience or fancy product features.
Return user-visible status within a few seconds for common paths.
Survive partner timeouts, duplicate webhooks, and partial downstream outages.
Maintain audit trails suitable for finance and compliance teams.
Keep sensitive card data outside the main application boundary.

Clarifying questions to ask¶

Do we support cards only, or cards plus bank transfer and wallets?
Is capture immediate, delayed, or both depending on the merchant?
What settlement delay should merchants expect before payout?
Are we the merchant of record or a platform for other merchants?
What exact PCI scope do we want to avoid in this design?
How should failed retries surface to the customer and merchant?

What to say on the whiteboard¶

State the user action, core data, and critical latency target.
Split must-have features from nice-to-have features immediately.
Name one honest gap before locking assumptions. Simple, no?
Ask what failure hurts most: money, freshness, or user trust.
Confirm whether single-region launch is acceptable for round one.
Summarise the scope before you move to numbers. Now watch.

Step 2: Scale Estimation¶

Do rough math. Clean math beats fancy math. So what to do? Pick clear assumptions and keep them verbal.

Assumptions¶

Assume 50 million payment attempts per day.
Assume peak load is 12 thousand payment requests per second.
Assume average request payload is 4 KB.
Assume each successful payment creates 6 ledger entries and events.
Assume webhook retries can multiply downstream messages by 3x.
Assume reconciliation files arrive hourly from processors.

Back-of-envelope math¶

12k TPS at peak means about 1 billion requests per day worst case.
At 4 KB per request, ingress bandwidth is about 48 MB per second.
Ledger events at 6 per payment mean 72 thousand ledger writes per second.
If each ledger row is 300 bytes, raw ledger growth is about 22 MB per second.
That becomes roughly 1.9 TB raw per day before indexing and replicas.
Webhook amplification means idempotency storage must outlive the first response.
Hourly reconciliation files for 50M payments can each contain millions of rows.
Keep storage partitioned by merchant and event date for fast audits.

Interview cue¶

Say the biggest number first, then derive storage and bandwidth.
Round aggressively. Nobody wants calculator theatre on the board.
Mention peak-to-average ratio and why it changes capacity planning.
Keep one reserve factor for retries, bursts, and replays.
Remember the stage is interactive, so sanity-check assumptions aloud.
End with the two numbers that drive architecture choice.

Step 3: High-Level Design¶

Now place the big boxes. Your blueprint should fit in one glance.

+--------+   +----------+   +-------------------+   +-------------+
| Client |-->| API Gate |-->| Payment Service   |-->| PSP Adapter |
+--------+   +----------+   +-------------------+   +-------------+
                             |          |                    |
                             v          v                    v
                       +-----------+ +-----------+    +-------------+
                       | Idem Store| | Ledger DB |    | Webhook In  |
                       +-----------+ +-----------+    +-------------+
                             |          |                    |
                             v          v                    v
                       +-----------+ +-----------+    +-------------+
                       | Event Bus | | Orders    |    | Recon Jobs  |
                       +-----------+ +-----------+    +-------------+

Main components¶

API gateway authenticates merchant calls and rate-limits abuse.
Payment service owns payment intent state machine and orchestration.
Idempotency store maps client key plus operation to the canonical outcome.
Ledger database records immutable debit and credit entries.
PSP adapter isolates partner-specific APIs, retries, and response mapping.
Webhook ingest handles asynchronous partner updates with dedupe.
Event bus fans out payment state changes to orders, risk, and notifications.
Reconciliation jobs compare internal truth against external processor reports.

Request path¶

Client creates a payment with an idempotency key.
Gateway forwards authenticated request to payment service.
Service checks idempotency store before doing any external side effect.
If new, service creates intent state and calls the processor adapter.
On processor success, service posts balanced ledger entries.
Event bus notifies order service and downstream reporting systems.
Later, webhooks update capture, settlement, or failure status.
Reconciliation jobs close the loop using processor and bank reports.

Design narration¶

Start with ingress, then routing, then state, then async work.
Separate control plane decisions from data plane traffic early.
Show where metadata lives and where heavy payloads travel.
Mark caches, queues, and databases with their exact job.
Point out one synchronous dependency you may later relax.
Pause and let the interviewer choose the next zoom-in area.

Step 4: Deep Dive¶

Pick two parts that actually matter. Depth without structure becomes noise. See.

Component A — Idempotency and retry semantics¶

Use merchant ID plus operation type plus client idempotency key as the lookup key.
Persist the first in-flight state before calling any processor.
Return the same final response for exact duplicate retries.
Distinguish retriable timeouts from already-completed processor actions.
Store request hash so a reused key with different payload is rejected.
Give idempotency records a retention window that covers client retry behaviour.
Make webhook processing idempotent too, not just API requests.
Avoid distributed transactions with processors; use state machines and replay.
Design retries with backoff so partner outages do not become self-inflicted DDoS.
Expose operation status endpoints so clients stop blind retrying.

Component B — Ledger and reconciliation¶

Model every movement as immutable debit and credit entries that sum to zero.
Keep ledger append-only so audits are straightforward and trustworthy.
Use account types for customer funds, fees, reserve, settlement, and refunds.
Post money only after the processor state reaches the required checkpoint.
Separate ledger truth from payment API convenience fields.
Build reconciliation as a repeatable pipeline, not a spreadsheet ritual.
Compare counts, sums, and individual transaction IDs against partner reports.
Quarantine mismatches for finance review and replay missing webhooks safely.
Never patch balances directly; create compensating entries instead.
Record who triggered manual actions and why. Auditability matters.

Deep-dive cue¶

Keep reasoning aloud clean while you zoom in.
Explain data model, hot path, and one ugly edge case.
Tie each deep dive back to a requirement you already named.
If numbers change the design, say that directly.
If one choice is uncertain, park it as research, not panic.
Return to the overall system before you get lost in detail.

Step 5: Tradeoffs & Failure Modes¶

Now show judgment. Interviewers hire the tradeoff thinker, not the diagram artist.

Key tradeoffs¶

Strict consistency protects money, but can increase latency and coupling.
Asynchronous confirmation improves resilience, but user messaging becomes harder.
Longer idempotency retention is safer, but storage and lookup cost rise.
Processor abstraction improves portability, but lowest-common-denominator APIs can leak.
Tokenisation reduces PCI scope, but external vault dependencies become critical.
Synchronous reconciliation is impossible, so delayed truth must be accepted.
One ledger database is simple, but partitioning becomes necessary at very high TPS.
Aggressive retries recover faster, but can duplicate partner-side operations.

Failure modes to discuss¶

Processor timeout after charge success can leave internal state unknown.
Duplicate webhooks can race with client polling and cause inconsistent views.
Missing ledger entries can make balances wrong even when payments succeeded.
Out-of-order webhook delivery can rewind state if handlers are naive.
Bad reconciliation mapping can mark good transactions as missing.
Vault outage can block token resolution even when business logic is healthy.
Key reuse with modified payload can corrupt idempotency semantics.
Manual ops without audit logs can create irrecoverable finance confusion.

Close the answer strongly¶

Say what breaks first under sudden load and how you contain it.
Compare the chosen design against one simpler alternative.
Mention operational metrics, not only code-level correctness.
Admit where future scale may require redesign. Honest and sharp.
Offer a phased rollout plan if the company is early-stage.
Finish with latency, reliability, and cost in one sentence.

Interview Q&A¶

Q1. Why is idempotency not enough by itself?¶

A. Because money systems also need immutable accounting and external reconciliation. A. Idempotency stops duplicates, but it does not prove balances are right. Common wrong answer to avoid: If requests are idempotent, the payment system is solved.

Q2. Why use double-entry ledger instead of updating one balance row?¶

A. A ledger preserves history, supports audits, and prevents hidden money creation. A. Balance rows are outputs, not the primary truth in finance systems. Common wrong answer to avoid: Balance rows are simpler, so they are always better.

Q3. How do you handle processor timeout after capture?¶

A. Mark the state as pending external confirmation and reconcile later. A. Do not blindly retry capture if the processor may already have executed it. Common wrong answer to avoid: Retry the capture immediately until it responds.

Q4. What is the clean way to minimise PCI scope?¶

A. Tokenise card data with a compliant vault and keep raw PAN away from core services. A. Then isolate the small boundary that touches sensitive data. Common wrong answer to avoid: Store encrypted card numbers in the main database and call it done.

Apply now (5 min)¶

Run the full choreography with a two-minute timer per step.
Redesign this system for marketplace payouts with delayed settlement.
List the minimum ledger accounts you need for one successful card payment.
Explain when you would authorise first and capture later.
Name three events that must be idempotent besides the initial API call.
Define one reconciliation report for finance and one for engineering.
Say one simplification for a startup that processes only cards in one country.

Bridge. Payments safe. Now AI-specific — an ML inference platform. → 10