Skip to content

05. Lead Frameworks

Hiring rubric

Levels at a glance

Level Years Scope Headline expectation
Mid (L4-equiv) 3-5 Owns features inside an existing AI product Can ship a RAG / agent feature with reasonable evals
Senior (L5-equiv) 5-8 Owns subsystems; mentors mids Designs and ships AI subsystems end-to-end with production discipline
Staff (L6-equiv) 8-12 Cross-team technical influence Sets technical direction; resolves cross-team architecture
Lead / Principal (L7+ or first IC at small company) 10+ Org-level technical vision Defines AI strategy; hires and grows the function

8-area rubric

Area Probe Mid bar Senior / Lead bar Green signal Red signal
LLM fundamentals Explain transformers; what is KV cache? Correct mental model Trade-offs, memory math, architecture implications Derives from first principles Buzzwords only
Prompting & orchestration Reduce hallucinations in a customer assistant Few-shot, structure, basic guardrails Retrieval, validators, refusal policy, cost/latency trade-offs Real production failure mode discussed "Just use better prompts"
RAG Design enterprise wiki RAG; debug 20% wrong answers Chunk, embed, retrieve, augment prompt Hybrid retrieval, rerank, citations, evals, freshness and when not to RAG Separates retrieval vs generation failures Cannot isolate failure class
Agents What are tool calls; how do you stop loops? Single-agent loop Iteration caps, planner/executor, idempotency, escalation, when not to use agents Has specific war stories Treats every problem as agentic
Evals How do you evaluate an LLM feature before launch? Offline eval set, basic metrics LLM-as-judge, human review, drift detection, regression gates, blast radius Evals treated as load-bearing "We tried it and it looked good"
MLOps / production LLM bill up 40% with flat traffic — debug Can name logs/metrics Serving stacks, batching, KV cache, SLOs, dashboards, rollouts Has operated or designed on-call systems Has only called APIs
Judgment / build-vs-buy Prompt vs RAG vs fine-tune vs self-host Understands basic tree Adds vendor risk, data sensitivity, talent cost, real numbers Prefers boring solution first Defaults to trendiest option
Communication & collaboration Behavioral: disagreed with PM about an AI feature Clear ownership and trade-offs Can translate across research, eng, product, legal, finance Structured, calm, explicit about trade-offs Fuzzy or combative

Calibration questions

  • Walk me through the worst AI feature you shipped. What did you learn?
  • What is the smallest change that would have most improved your last AI project?
  • What AI capability do you actively distrust?

Candidate anti-patterns

  • Treats every LLM problem as a prompting problem.
  • Cannot separate retrieval failure from generation failure.
  • Builds agents when deterministic flows would do.
  • Quotes benchmarks without understanding them.
  • Has never operated a production service.

Build / buy / fine-tune decision tree

1. Can prompting alone solve it?
   yes -> stop.
   no  -> 2

2. Is the missing ingredient knowledge?
   yes -> RAG.
   no  -> 3

3. Is the missing ingredient behavior / style / format reliability?
   yes -> 4
   no  -> 5

4. Is prompt engineering + few-shot enough?
   yes -> stop.
   no  -> fine-tune (LoRA first; full FT only if needed).

5. Is the missing ingredient latency / cost / data residency?
   yes -> distill + self-host small model.
   no  -> reframe; you may not need ML.

When each option wins

  • Prompting: fastest iteration; vendor quality improving; no clean dataset.
  • RAG: knowledge changes often; knowledge is private; citations required; clean retrieval index possible.
  • Fine-tuning: stable format/style needed; prompt needs compressing; tight latency budget; good dataset + eval set available.
  • Distill + self-host: high-volume stable workload; cost/latency is binding; data sensitivity matters; GPU + ops capacity exists.

Honest fine-tuning costs

  • Dataset curation is the hard part.
  • Eval set matters more than training set.
  • LoRA is cheaper than full FT.
  • Fine-tuned systems still need ongoing eval.
  • Vendor model improvements can obsolete your FT quickly.

Vendor vs open-weights

Factor Vendor API Open-weights
Time to first prototype Hours Days to weeks
Peak quality Highest available Improving; usually lags frontier vendors
Cost at scale Per-token, predictable Better at high utilization; fixed GPU cost
Latency control Vendor-bounded Yours to tune
Data residency Vendor terms Your environment
Lock-in Higher Lower
Ops burden Minimal Real MLOps + GPUs + on-call
Upgrade cadence Vendor upgrades automatically You choose and manage upgrades
  • Default: prototype on vendor; evaluate self-hosting only when workload justifies it.

Build vs buy across the stack

Layer Default choice Build only when...
Foundation model Buy / use open weights Almost never
Embedding model Buy / open-source Strong domain-specific need
Vector store Buy / OSS Almost never
Eval harness Start with vendor/OSS Own rubrics or workflow demand more
Orchestration Use framework early Long-term abstractions or framework rot justify custom
Observability Buy / existing platform Need product-specific depth
Serving vLLM / TGI / SageMaker Almost never
Prompt management Vendor / OSS Versioning + A/B needs are custom

Phased rollout

Phase Goal Exit criteria Typical duration
Spike Prove the model can do the task One end-to-end demo with messy code 1-2 weeks
PoC Prove value with real data Quality bar agreed; offline evals defined 2-4 weeks
Pilot Real users, narrow scope SLOs met on small cohort; no sev-1 issues for 2 weeks 4-8 weeks
GA Broad rollout Cost / latency / quality budgets defended; on-call in place Ongoing
  • Anti-pattern: skipping phases under deadline pressure.

Team shapes

Shape When to use Headcount Trade-off
Embedded AI engineers Many product surfaces; AI is feature, not product 1-2 per product team Fast velocity; weak shared infra
Central AI / applied ML team Few high-leverage AI surfaces 4-8 in one team Strong craft; risk of bottleneck
Platform + applied split 10+ AI engineers; shared infra justified 4-6 platform, 6-12 applied Scales well; needs good contracts
+ Research arm Frontier moat required 2-4 researchers Expensive; only worth it for true asymmetry
  • Default for Series B / mid-stage: central applied team of 4-6 + 1 platform-leaning engineer.
  • Split into platform + applied around 10 engineers.

Roles inside the team

Role What they own Hire when
AI Engineer (applied) Features, prompts, tools, evals Day 1
AI Platform Engineer Serving, eval harness, observability Once 2+ applied engineers exist
MLOps / Infra GPUs, deployment, cost, environments Self-hosting or multi-model complexity starts
ML Engineer (classical) Traditional ML pipelines Work truly demands it
Research / Applied Scientist Novel methods, papers Competitive asymmetry requires it
Eval / Quality specialist Gold sets, review workflow, rubrics Eval volume justifies specialization

Prioritizing initiatives

Dimension Question
Value If this works, what business/user impact appears?
Confidence Have we or peers shipped this before?
Leverage Does this unblock other work?
Reversibility Can we roll it back cleanly?
Capacity Do we have people, evals, and ops to support prod?
  • Default: ship one high-confidence, high-leverage win first; then take a bigger swing.

Cross-functional contracts

Partner We owe They owe
PM Capability bounds; eval-backed claims Acceptance criteria; user data access
Design Failure-state UX; latency expectations Streaming / refusal patterns
Data Eng Schema + freshness needs Reliable pipelines; quality SLAs
Security / Privacy Threat model; data flow Approved vendors; data classification
Legal / Compliance Use-case and vendor list Approval cadence; risk register
Finance Cost forecast; unit economics Budget visibility; chargeback if needed

Cost equation

$ per call ~= (input_tokens x $/1M_input)
          + (output_tokens x $/1M_output)
          + (cached_input x cache discount)
  • Output tokens are often the dominant cost.
  • Token counts compound fast: system prompt + few-shot + history + retrieved context.
  • Always log actual token usage.

8-lever cost playbook

Lever Effort Typical impact Notes
1. Cap output tokens Minutes 10-40% Lower max_tokens by endpoint
2. Prompt caching Hours to a day 30-70% Put stable prefixes first
3. Model routing Days 30-80% Cheap-for-easy, expensive-for-hard
4. Prompt compression Days 10-30% Remove dead context; summarize history
5. Batch API Hours Up to 50% Use for async workloads
6. RAG over long context Days to weeks 30-70% Retrieve chunks instead of stuffing docs
7. Self-host open weights Weeks + ops 50-90% at high utilization Only if utilization math works
8. Distill / fine-tune small models Weeks 60-90% per inference Needs strong dataset + eval discipline

Self-host break-even

API cost / month = QPS x 86_400 x 30 x avg_tokens x $/token
Self-host cost   = (GPUs x $/hour x 730) + ops_overhead
  • Illustrative heuristics:
  • One H100 ~= \(1.5K-\)3.5K/month depending on pricing/utilization assumptions.
  • Self-hosting usually becomes interesting once spend is well above hobby scale and workload is stable.
  • Practical rule:
  • < $20K/month: stay on vendor APIs.
  • \(50K-\)100K+/month: self-hosting deserves real math.
  • Between: model case by case.

Executive dashboard

Panel Why it matters
$ this month vs forecast Shows budget control
$ by feature / team Shows where spend concentrates
$ per active user Shows unit economics
Cache hit rate Measures prompt-caching leverage
Tokens by model tier Measures routing efficiency
Avg output tokens / call Measures answer-length control
% on Batch API Measures async discount usage

Governance practices

Practice What it means
Per-feature budgets Soft alerts at 75%; hard cutoffs at 100% unless overridden
Pre-launch cost review Every AI feature gets a forecast before code freeze
Quarterly cost-down sprint Dedicated time for lever work
Vendor pricing watch Track price changes; renegotiate when spend warrants
Per-PR cost diff CI estimates $/call delta on prompt or model changes
Eval gates before rollout Prompt/model changes need offline checks
Audit logs for sensitive actions Needed for trust, security, and compliance
Kill switches + rollback paths Required for high-stakes AI features

Useful interview answer templates

  • Build-vs-buy: Default to prompting; then RAG; then fine-tune only with clean dataset + evals; self-host only when volume or residency justifies ops burden.
  • Cost: Make cost visible by feature/model/user; then run levers in order: output caps, caching, routing, compression, batch, retrieval, self-host, distill.
  • Team design: Start central applied team with one platform-leaning engineer; split once shared infra becomes load-bearing.