03. Architecture Choices — Single Pipeline vs RAG vs Agent vs Multi-Agent¶

~12 min read. Choosing the wrong architecture wastes months; choosing correctly happens in one meeting with the blueprint.

Built on the ELI5 in 00-eli5.md. The foundation — infrastructure choices that everything else rests on — is decided here. A wrong foundation cannot be patched. It must be replaced.

Four architectures, four use cases¶

See. There is no universally best architecture. There is only the right architecture for a given blueprint. Let us draw all four and then learn when each applies.

┌─────────────────────────────────────────────────────────────┐
│  Architecture 1: Single Prompt Pipeline                     │
│  User Query ──▶ LLM ──▶ Response                           │
│  Best for: self-contained tasks, closed-domain, fast        │
├─────────────────────────────────────────────────────────────┤
│  Architecture 2: RAG Pipeline                               │
│  User Query ──▶ Retriever ──▶ Context ──▶ LLM ──▶ Response │
│  Best for: knowledge-grounded Q&A, fresh data, citations    │
├─────────────────────────────────────────────────────────────┤
│  Architecture 3: Single Agent                               │
│  User Query ──▶ LLM ──▶ Tool Call ──▶ Result ──▶ LLM ──▶ …│
│  Best for: multi-step tasks with uncertain tool sequences   │
├─────────────────────────────────────────────────────────────┤
│  Architecture 4: Multi-Agent                                │
│  Orchestrator ──▶ Agent A ──▶ Agent B ──▶ Aggregator       │
│  Best for: parallel subtasks, specialised roles, complexity │
└─────────────────────────────────────────────────────────────┘

Notice: complexity increases downward. Complexity adds latency, cost, and failure surface. Choose the simplest architecture that satisfies your constraints.

Architecture 1: Single Prompt Pipeline¶

This is the baseline. One call to one model. Input: user query + system prompt. Output: response.

When to use: - The task is self-contained (no external data needed at inference time). - The model's training data covers the domain well. - Latency SLA is tight (< 500 ms).

When not to use: - The domain changes faster than model training cycles. - You need citations or source attribution. - Accuracy on specific facts is critical.

The foundation for this architecture is minimal: an API key and a rate limit budget. No vector database. No orchestration framework. Keep it small until constraints force you upward.

Architecture 2: RAG Pipeline¶

RAG adds a retrieval step before generation. This is the plumbing that connects knowledge stores to the model.

                  ┌──────────────────────┐
                  │  Knowledge Base      │
                  │  (Vector DB + Index) │
                  └──────────┬───────────┘
                             │ retrieve top-k chunks
User Query ──▶ Embed ────────┘
                  │
                  ▼
             Assemble Context (chunks + query)
                  │
                  ▼
                 LLM ──▶ Grounded Response

When to use: - Your knowledge base updates frequently (weekly or more). - Users need citations to trust the answer. - Model hallucination on domain facts is unacceptable. - Your context is larger than what fits in a prompt.

When not to use: - The task is purely generative (creative writing, code generation from scratch). - Retrieval latency would break your SLA. - Your knowledge base is small enough to fit in the system prompt entirely.

Simple, no? If the answer lives in a document, use RAG. If it lives in the model, use prompting.

Architecture 3: Single Agent¶

An agent adds a loop. The model can call tools and use results before finishing.

User Query
    │
    ▼
  LLM ──▶  Thought: "I need the current price"
    │
    ▼
  Tool Call: get_price(product="Widget A")
    │
    ▼
  Tool Result: { "price": 14.99 }
    │
    ▼
  LLM ──▶  "The Widget A costs $14.99 as of today."

The key property: the model decides which tools to call and in what order. This is powerful. It is also unpredictable.

When to use: - The task involves multiple steps whose sequence is not known in advance. - External data sources must be queried at inference time. - The user's intent is ambiguous and clarification might be needed.

When not to use: - The task is deterministic and the tool sequence is always the same. - Latency SLA is < 1 000 ms. - You cannot tolerate unpredictable behaviour.

Architecture 4: Multi-Agent¶

Multi-agent splits work across specialised models. One orchestrator, many workers.

          ┌─────────────────────────────────────┐
          │          Orchestrator LLM           │
          └──┬──────────────┬──────────────┬───┘
             │              │              │
             ▼              ▼              ▼
       Agent A         Agent B         Agent C
    (Research)       (Drafting)      (Fact-check)
             │              │              │
             └──────────────┴──────────────┘
                            │
                       Final Output

When to use: - Subtasks are independent and can run in parallel. - Each subtask requires a different skill or tool set. - The overall task is too complex for one context window.

When not to use: - Almost all other situations. Multi-agent adds orchestration complexity, inter-agent error propagation, and cost multiplication. Do not use it until simpler architectures provably fail.

Worked example: choosing under constraints¶

Blueprint says: customer support ticket triage. Constraints: latency ≤ 800 ms, cost ≤ $0.002 per call, privacy: no data leaves VPC.

Apply the filter:

Multi-agent?  No. Orchestration latency alone > 800 ms. Eliminated.
Single agent? Maybe. But latency risk from tool loops. Risky.
RAG?          Yes. One retrieval + one LLM call. 120 + 600 = 720 ms. Fits.
Single prompt? Too risky. KB has 12,000 articles. Model cannot recall them all.

Decision: RAG Pipeline.

Look. The decision took four lines once you had the blueprint. That is the value of the blueprint — it makes architecture obvious. The foundation is now clear: vector DB + embedding model, both on-premise.

Where this lives in the wild¶

Cursor IDE — single agent loop: model decides which file to read, edit, or search next.
Perplexity.ai — RAG pipeline: search → retrieve → grounded answer with citations.
Jasper.ai — single prompt pipeline: creative copy generation from templates.
Harvey (legal AI) — multi-agent: research agent + drafting agent + citation-check agent for complex filings.
Klarna AI — RAG for customer-facing FAQ; constraints (latency, compliance) made agent loop non-viable.

Pause and recall¶

Name all four architectures in order of complexity.
What two conditions make a single prompt pipeline the wrong choice?
In the support ticket example, why was multi-agent eliminated first?
What is the key property of an agent that distinguishes it from a RAG pipeline?

Interview Q&A¶

Q: "When would you choose RAG over a single prompt pipeline?"

A: When the knowledge changes faster than model retraining, when citations are required, or when the domain knowledge is too large for the system prompt. If the model already knows the domain well and facts do not change, a single prompt pipeline is simpler and faster.

Common wrong answer to avoid: "RAG is always better because it uses real data." RAG adds retrieval latency and a new failure mode. It should be chosen, not defaulted to.

Q: "What are the risks of using a multi-agent architecture?"

A: Orchestration latency, error propagation between agents (one agent's bad output cascades to the next), debugging difficulty, and cost multiplication (each agent makes model calls). Use multi-agent only when parallelism provides clear value.

Common wrong answer to avoid: "Multi-agent is more powerful so it gives better results." More complex is not the same as more accurate.

Q: "How do you decide between a single agent and a RAG pipeline for a task?"

A: If the retrieval pattern is always the same — same query, same index, one call — use RAG. If the model must decide what to retrieve, when to retrieve, and whether to query multiple sources, use an agent. When in doubt, start with RAG.

Common wrong answer to avoid: "Use an agent for everything because they're flexible." Flexibility means unpredictability. For production, predictability is more valuable.

Q: "A product manager asks you to add an 'AI assistant' to the app. What is your first question?"

A: What specific job does the user need done, and what is the latency and accuracy requirement? Without that, I cannot choose an architecture. The model choice comes last.

Common wrong answer to avoid: "I'd start by integrating GPT-4 and see what it can do." Demo-first design produces systems that look impressive but fail on real user tasks.

Apply now (5 min)¶

Take the blueprint you wrote in file 02. Apply the constraint filter: which architectures are eliminated immediately? Write one sentence of justification for each elimination. Commit to the simplest architecture that survives.

Sketch from memory: Draw the four architecture diagrams — boxes and arrows only, no labels. Then add the labels from memory. Check against this file.

Bridge. Architecture chosen. Now we go inside the plumbing — how data flows through the pipeline from raw source to assembled context. → 04-data-pipeline-design.md