Skip to content

13. AI Component LLD — room-by-room design for model products

~16 min read. AI products look small outside, but inside they need specialised rooms.

Built on the ELI5 in 00-eli5.md. The room — a class or module with single responsibility — needs a clean hallway and careful wiring when AI parts collaborate.


1) Prompt management: treat prompts like product code, not loose strings

See. A prompt is not just text. It encodes policy, tone, instructions, and placeholders. If you scatter prompt strings across files, drift starts immediately. One engineer edits the safety clause. Another forgets it in a fallback path. Soon your outputs disagree. So what to do? Create a prompt manager component. It stores template ids, versions, variables, and rollout state. Simple picture. ┌──────────────┐ │ PromptClient │ └──────┬───────┘ │ asks for ▼ ┌──────────────┐ │ TemplateRepo │ ├──────────────┤ │ id: summary │ │ version: v3 │ │ locale: en │ └──────┬───────┘ │ renders with vars ▼ ┌──────────────┐ │ RenderEngine │ └──────────────┘ Worked example. Template id is customer_reply. Version v7 adds one refusal sentence. Variables are user_name, issue_type, and refund_eta_days. For ticket 551, refund_eta_days equals 3. Prompt manager should render the same version for all retries in that request. Concrete code-level sketch.

class PromptTemplate:
    def __init__(self, template_id, version, body):
        self.template_id = template_id
        self.version = version
        self.body = body
class PromptManager:
    def render(self, template_id, version, variables):
        template = self.repo.get(template_id, version)
        return self.engine.render(template.body, variables)
Simple, no? Store prompt version with every model response. Otherwise you cannot debug regressions properly. Also keep review workflows. Prompt edits deserve code review, staging checks, and rollback.

2) Model serving internals: registry, batching, and predictable invocation

The serving layer should hide vendor and deployment chaos. Upstream code asks for inference. Serving code decides which model instance, batch policy, and endpoint to use. That is one specialised component, not random helper functions. A model registry maps logical names to concrete deployments. Example: support-summary maps to claude-sonnet-us-east-v4. Batching groups compatible requests to improve throughput. But batching also adds queue wait. So design both knobs together. Diagram. ┌──────────────┐ │ InferenceAPI │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ ModelRegistry│ └──────┬───────┘ │ resolve ▼ ┌──────────────┐ │ BatchQueue │ └──────┬───────┘ │ flush ▼ ┌──────────────┐ │ ModelWorker │ └──────────────┘ Worked example with numbers. Batch size max is 8. Flush timeout is 20 ms. Average single request latency is 280 ms. With light batching, throughput rises from 25 to 70 requests per second. But if flush timeout becomes 120 ms, p95 may cross SLA. So what to do? Tune by workload, not by optimism. Concrete code-level sketch.

interface ModelRegistry {
  resolve(useCase: string, region: string): DeploymentRef
}
interface Batcher {
  enqueue(job: InferenceJob): Promise<InferenceResult>
}
Also expose idempotency keys and request ids. Serving bugs are nasty when duplicate retries produce duplicate charges.

3) Feature transformation pipelines: make inputs model-ready, stage by stage

Most production AI features depend on transformed inputs. Raw events become embeddings, normalized fields, or scored signals. If transformation code is mixed into serving or prompting, ownership becomes muddy. Keep feature transformation as a separate pipeline component. Each stage should declare input, output, and failure mode. See this flow. ┌──────────┐ → ┌──────────┐ → ┌──────────┐ → ┌──────────┐ │ raw text │ │ cleanup │ │ language │ │ embed │ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │ ▼ ┌──────────┐ │ feature │ │ vector │ └──────────┘ Worked example. Review text has 600 characters. Cleanup removes HTML and drops length to 540. Language detector tags it hi-en mixed. Embedder outputs a 1536-length vector. A downstream ranker then combines vector similarity with price score 0.82. That is not one method. That is a pipeline with typed stages. Concrete code-level sketch.

record FeatureContext(String text, String lang, float[] vector) {}
interface FeatureStage {
  FeatureContext apply(FeatureContext ctx);
}
Why stage this explicitly? Because broken features poison outputs silently. You want stage-wise logs, retry policies, and validation checks.

4) Guardrail layer design: safety should be a component, not a side thought

Many teams bolt safety on at the end. Then they discover it must inspect input, retrieval context, and output. So guardrails deserve their own component boundary. Usually this layer performs policy checks before and after model generation. It may redact PII, block unsafe intents, enforce tool rules, and scan output. Think in passes. ┌──────────────┐ │ InputChecks │ ├──────────────┤ │ ContextRules │ ├──────────────┤ │ OutputChecks │ └──────────────┘ Worked example. Input contains a phone number and a self-harm phrase. InputChecks masks the phone number. Policy classifier gives risk score 0.91. Request is escalated to a safe response template. If generation still emits disallowed medical advice, OutputChecks block it. That two-pass design matters. Concrete code-level sketch.

class GuardrailDecision:
    def __init__(self, allowed, reason, transformed_text):
        self.allowed = allowed
        self.reason = reason
        self.transformed_text = transformed_text
class GuardrailLayer:
    def before_model(self, ctx):
        ...
    def after_model(self, ctx, output):
        ...
Simple, no? A safe AI system does not trust one check at one point. It layers checks across the path.

5) Evaluation harness: make quality measurable before users discover regressions

Prompt tweaks and model swaps need repeatable evaluation. Otherwise teams judge quality from two screenshots and one loud opinion. Build an evaluation harness as a first-class testing component. It should load datasets, execute candidates, score outputs, and publish comparisons. The harness must record prompt version, model version, feature flags, and seed when possible. Diagram. ┌──────────────┐ │ TestDataset │ └──────┬───────┘ │ feeds ▼ ┌──────────────┐ │ CandidateRun │ └──────┬───────┘ │ scored by ▼ ┌──────────────┐ │ Evaluators │ ├──────────────┤ │ exact match │ │ rubric score │ │ cost/latency │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Report │ └──────────────┘ Worked example. Dataset has 200 support tickets. Candidate A answers 176 acceptably. Candidate B answers 184 acceptably but costs 22 percent more. Average latency of B is 1.9 seconds versus 1.4 seconds. Now decision becomes explicit. Concrete code-level sketch.

interface Evaluator {
  score(output: CandidateOutput, expected: ExpectedCase): EvalScore
}
interface Harness {
  run(datasetId: string, candidate: CandidateSpec): EvalReport
}
See the LLD lesson. Evaluation is not a spreadsheet ritual. It is a module with clear contracts and repeatable outputs.


Where this lives in the wild

At OpenAI, a product engineer can rely on prompt versioning and eval harnesses before changing system prompts for a user-facing feature. At Netflix, an ML platform engineer may design model registries and batch queues for recommendation inference services. At Duolingo, an applied AI engineer can isolate guardrails and prompt templates for lesson feedback generation. At Meesho, a search relevance engineer may build feature transformation pipelines for multilingual product ranking. At AWS, a SageMaker platform engineer might separate registry, serving, and evaluation components to support many model teams.


Pause and recall

Why should prompt templates have versions instead of plain string constants? What tradeoff appears when batching improves throughput? Why keep feature transformation outside the model serving layer? What facts must an evaluation harness record to make comparisons trustworthy?


Interview Q&A

Why PromptManager not plain strings in service code?

PromptManager centralizes versioning, rendering, rollout, and debugging metadata. Common wrong answer to avoid: Prompts are just text, so normal constants are enough forever.

Why model registry not direct endpoint URLs in code?

A registry decouples use case intent from deployment details and region changes. Common wrong answer to avoid: Hardcoded endpoints are simpler because everyone knows the current model.

Why separate guardrail layer not one final moderation call?

Guardrails often need pre-model and post-model checks with different policies and actions. Common wrong answer to avoid: One moderation API after generation covers the whole safety problem.

Why evaluation harness not ad-hoc manual testing?

A harness preserves repeatability, comparability, and release confidence across prompt and model changes. Common wrong answer to avoid: If outputs look good in a few samples, shipping is fine.


Apply now (5 min)

Pick one AI feature such as support reply generation. Name five components: prompt manager, serving layer, feature pipeline, guardrail layer, and eval harness. For each, write one responsibility and one contract method. Sketch from memory: draw the full flow from raw input to transformed features to prompt render to model call to guardrail to evaluation result.


Bridge. We have designed the AI rooms. Next, let us admit honestly where LLD advice itself becomes fuzzy. → 14-honest-admission.md