04. Data Pipeline Design — Retrieval, Context Assembly, Freshness, Chunking¶

~11 min read. The best model cannot save a bad retrieval pipeline; the data flowing in determines the answer flowing out.

Built on the ELI5 in 00-eli5.md. The plumbing — data pipelines connecting all components — is what carries information from raw sources to the model context window. Bad plumbing produces correct-sounding wrong answers.

Picture the pipeline before touching code¶

See. Before writing any code, draw the data flow end-to-end. Here is the canonical RAG data pipeline.

Raw Source          Ingestion           Index              Query-Time
(PDFs, DB,    ──▶  (chunk + embed) ──▶ (vector store) ──▶ (retrieve + rerank + assemble)
 HTML, API)
     │                   │                   │                    │
     │                   │                   │                    ▼
     │                   │                   │             Context Window
     │                   │                   │             [System Prompt]
     │                   │                   │             [Retrieved Chunks]
     │                   │                   │             [User Query]
     │                   │                   │                    │
     └───────────────────┴───────────────────┘                    ▼
                                                               LLM Response

Every box is a decision. Every arrow is a contract that can break (see file 01). The plumbing must be designed, not improvised.

Step 1: Chunking strategy¶

Chunking is the first decision and the one most teams get wrong. A chunk is the unit of retrieval. Too large: the model gets irrelevant context inside a big chunk. Too small: the model gets relevant words but loses surrounding meaning.

The right chunk size depends on your task.

Task                        Recommended chunk size  Overlap
──────────────────────────────────────────────────────────
FAQ answer retrieval        200–400 tokens          50 tokens
Legal clause extraction     500–800 tokens          100 tokens
Code function retrieval     Full function           0 tokens
Paragraph-level summaries   300–500 tokens          75 tokens

Overlap prevents meaning from being cut at chunk boundaries. An overlap of 50 tokens means consecutive chunks share 50 tokens of context.

Look. Chunking is the plumbing at the source end. A bad chunking strategy cannot be fixed downstream. Test chunking with retrieval recall before building anything else.

Step 2: Embedding and indexing¶

After chunking, each chunk is converted to an embedding vector. The vector is stored in a vector database with the chunk text as metadata.

Here is the maths, briefly.

Chunk text:  "Refunds are processed within 5 business days."
Embed:        text-embedding-3-small  →  [0.12, -0.87, 0.34, ..., 0.09]  (1536 dims)
Store:        vector_db.upsert(id="chunk_42", vector=..., metadata={text: ..., source: ...})

At query time:

Query text:  "How long does a refund take?"
Embed query: [0.11, -0.84, 0.36, ..., 0.08]
Similarity:  cosine_sim(query_vec, chunk_42_vec) = 0.94  ←  top result

Cosine similarity measures the angle between two vectors. Values near 1.0 mean very similar meaning. Values near 0 mean unrelated.

Simple, no? The embedding model is the translator. It converts text into geometry. Similar text lives close together.

Step 3: Freshness and staleness¶

Data pipelines have a freshness policy. Freshness: how often the index is updated. Staleness: the lag between source update and index update.

Source update cadence    →   Freshness target   →  Ingestion strategy
───────────────────────────────────────────────────────────────────────
Hourly (live prices)    →   < 30 min lag       →  Streaming ingestion
Daily (KB articles)     →   < 4 hours lag      →  Nightly batch + delta sync
Weekly (policy docs)    →   < 24 hours lag     →  Weekly full re-index
Rarely (static docs)    →   < 1 week lag       →  On-change trigger

Staleness is a business risk, not a technical nuance. If your KB says "30-day return policy" but the company changed it to "14 days last week," your system will confidently give wrong answers.

Always write the freshness SLA in the blueprint. Then design the ingestion pipeline to meet it.

Worked example: calculating freshness cost¶

You have 12 000 chunks. Each chunk re-embed costs $0.0001 (text-embedding-3-small pricing). Full nightly re-index: 12 000 × $0.0001 = $1.20 per night = $36 per month.

Delta sync (only changed chunks): If 5% of articles change daily, 600 chunks × $0.0001 = $0.06 per night = $1.80 per month.

Strategy        Monthly cost   Staleness risk
Full re-index   $36.00         Low (all fresh every night)
Delta sync      $1.80          Medium (unchanged chunks may be stale in metadata)

Delta sync is 20× cheaper. Full re-index is simpler and more reliable. The right choice depends on your accuracy and budget constraints from the blueprint.

Step 4: Context assembly¶

Context assembly is the last step of the plumbing before the LLM call. It combines retrieved chunks with the query into the final prompt.

Context window layout:
┌────────────────────────────────────────────────┐
│  SYSTEM PROMPT (role + instructions)  ~200 tok │
│  ─────────────────────────────────────────     │
│  CONTEXT BLOCK                        ~800 tok │
│    [Chunk 1 — score 0.94]                      │
│    [Chunk 2 — score 0.91]                      │
│    [Chunk 3 — score 0.88]                      │
│  ─────────────────────────────────────────     │
│  USER QUERY                            ~50 tok │
└────────────────────────────────────────────────┘
Total: ~1050 tokens. Budget: 1024. OVER.

See. The token budget must be enforced during assembly, not discovered at runtime. Write an assembly function that measures token count and truncates chunks, not the query or the system prompt.

Always keep the query and system prompt intact. Truncate from the bottom of the chunk list (lowest-scored chunks first).

Where this lives in the wild¶

Notion AI — chunks Notion pages at block boundaries, not fixed token counts; block = natural semantic unit.
Glean — enterprise search indexes documents with role-based ACLs embedded in metadata; access control enforced at retrieval time.
Salesforce Einstein — CRM records chunked by field type; customer history chunked separately from product catalogue.
Intercom Fin — freshness SLA of 4 hours; delta ingestion syncs only updated help articles.
Stack Overflow AI — full re-index nightly; accuracy on stable answers matters more than freshness cost.

Pause and recall¶

What is the risk of chunks that are too large? Too small?
At what stage does staleness become a business risk rather than a technical issue?
In the freshness cost example, why is delta sync 20× cheaper?
Which parts of the context window should never be truncated?

Interview Q&A¶

Q: "How do you choose chunk size for a RAG system?"

A: I start with the task. FAQ retrieval works well at 200–400 tokens. Legal or technical documents need 500–800 tokens to preserve clause context. I always add overlap (50–100 tokens) to avoid cutting meaning at boundaries. I then measure retrieval recall on a held-out set and adjust.

Common wrong answer to avoid: "I use the default chunk size from the framework." Default chunk sizes are guesses, not task-specific tuning.

Q: "How do you handle stale data in a production RAG system?"

A: I define a freshness SLA in the blueprint: how old is too old for this use case? Then I build a delta ingestion pipeline that syncs changed documents. I also add a staleness check at retrieval time — if a chunk's last-updated timestamp is older than the SLA, I flag it in the response or re-fetch.

Common wrong answer to avoid: "I do a full re-index every hour." That is expensive and often unnecessary; delta sync is cheaper and scalable.

Q: "What happens if the context window overflows during assembly?"

A: I truncate lowest-scoring chunks first, never the system prompt or user query. I enforce the budget inside the assembly function, not at the API call. If even the top chunk exceeds the budget, I chunk more aggressively.

Common wrong answer to avoid: "The API will truncate it automatically." Automatic API truncation may cut the system prompt or the query, breaking the pipeline silently.

Q: "Why does embedding model choice matter for retrieval quality?"

A: The embedding model determines the geometry of the vector space. If query and document embeddings are generated by different models or model versions, cosine similarity becomes unreliable. Always embed queries and documents with the same model. Never mix embedding model versions in the same index.

Common wrong answer to avoid: "Any embedding model works fine." Semantic alignment between query and chunk embedding is the entire basis of retrieval quality.

Apply now (5 min)¶

Take your capstone project blueprint. Design the data pipeline: source format, chunking strategy, chunk size, overlap, freshness SLA. Calculate the monthly embedding cost for a full nightly re-index at your estimated chunk count. Compare with delta sync cost at 5% daily change rate.

Sketch from memory: Draw the end-to-end pipeline from raw source to context window assembly. Label each box. Include the token budget at the assembly stage.

Bridge. Pipeline designed. Now we decide what order to build things in — because build order determines what you learn and when. → 05-implementation-strategy.md