05. Lead Frameworks¶
Hiring rubric¶
Levels at a glance¶
| Level | Years | Scope | Headline expectation |
|---|---|---|---|
| Mid (L4-equiv) | 3-5 | Owns features inside an existing AI product | Can ship a RAG / agent feature with reasonable evals |
| Senior (L5-equiv) | 5-8 | Owns subsystems; mentors mids | Designs and ships AI subsystems end-to-end with production discipline |
| Staff (L6-equiv) | 8-12 | Cross-team technical influence | Sets technical direction; resolves cross-team architecture |
| Lead / Principal (L7+ or first IC at small company) | 10+ | Org-level technical vision | Defines AI strategy; hires and grows the function |
8-area rubric¶
| Area | Probe | Mid bar | Senior / Lead bar | Green signal | Red signal |
|---|---|---|---|---|---|
| LLM fundamentals | Explain transformers; what is KV cache? | Correct mental model | Trade-offs, memory math, architecture implications | Derives from first principles | Buzzwords only |
| Prompting & orchestration | Reduce hallucinations in a customer assistant | Few-shot, structure, basic guardrails | Retrieval, validators, refusal policy, cost/latency trade-offs | Real production failure mode discussed | "Just use better prompts" |
| RAG | Design enterprise wiki RAG; debug 20% wrong answers | Chunk, embed, retrieve, augment prompt | Hybrid retrieval, rerank, citations, evals, freshness and when not to RAG | Separates retrieval vs generation failures | Cannot isolate failure class |
| Agents | What are tool calls; how do you stop loops? | Single-agent loop | Iteration caps, planner/executor, idempotency, escalation, when not to use agents | Has specific war stories | Treats every problem as agentic |
| Evals | How do you evaluate an LLM feature before launch? | Offline eval set, basic metrics | LLM-as-judge, human review, drift detection, regression gates, blast radius | Evals treated as load-bearing | "We tried it and it looked good" |
| MLOps / production | LLM bill up 40% with flat traffic — debug | Can name logs/metrics | Serving stacks, batching, KV cache, SLOs, dashboards, rollouts | Has operated or designed on-call systems | Has only called APIs |
| Judgment / build-vs-buy | Prompt vs RAG vs fine-tune vs self-host | Understands basic tree | Adds vendor risk, data sensitivity, talent cost, real numbers | Prefers boring solution first | Defaults to trendiest option |
| Communication & collaboration | Behavioral: disagreed with PM about an AI feature | Clear ownership and trade-offs | Can translate across research, eng, product, legal, finance | Structured, calm, explicit about trade-offs | Fuzzy or combative |
Calibration questions¶
- Walk me through the worst AI feature you shipped. What did you learn?
- What is the smallest change that would have most improved your last AI project?
- What AI capability do you actively distrust?
Candidate anti-patterns¶
- Treats every LLM problem as a prompting problem.
- Cannot separate retrieval failure from generation failure.
- Builds agents when deterministic flows would do.
- Quotes benchmarks without understanding them.
- Has never operated a production service.
Build / buy / fine-tune decision tree¶
1. Can prompting alone solve it?
yes -> stop.
no -> 2
2. Is the missing ingredient knowledge?
yes -> RAG.
no -> 3
3. Is the missing ingredient behavior / style / format reliability?
yes -> 4
no -> 5
4. Is prompt engineering + few-shot enough?
yes -> stop.
no -> fine-tune (LoRA first; full FT only if needed).
5. Is the missing ingredient latency / cost / data residency?
yes -> distill + self-host small model.
no -> reframe; you may not need ML.
When each option wins¶
- Prompting: fastest iteration; vendor quality improving; no clean dataset.
- RAG: knowledge changes often; knowledge is private; citations required; clean retrieval index possible.
- Fine-tuning: stable format/style needed; prompt needs compressing; tight latency budget; good dataset + eval set available.
- Distill + self-host: high-volume stable workload; cost/latency is binding; data sensitivity matters; GPU + ops capacity exists.
Honest fine-tuning costs¶
- Dataset curation is the hard part.
- Eval set matters more than training set.
- LoRA is cheaper than full FT.
- Fine-tuned systems still need ongoing eval.
- Vendor model improvements can obsolete your FT quickly.
Vendor vs open-weights¶
| Factor | Vendor API | Open-weights |
|---|---|---|
| Time to first prototype | Hours | Days to weeks |
| Peak quality | Highest available | Improving; usually lags frontier vendors |
| Cost at scale | Per-token, predictable | Better at high utilization; fixed GPU cost |
| Latency control | Vendor-bounded | Yours to tune |
| Data residency | Vendor terms | Your environment |
| Lock-in | Higher | Lower |
| Ops burden | Minimal | Real MLOps + GPUs + on-call |
| Upgrade cadence | Vendor upgrades automatically | You choose and manage upgrades |
- Default: prototype on vendor; evaluate self-hosting only when workload justifies it.
Build vs buy across the stack¶
| Layer | Default choice | Build only when... |
|---|---|---|
| Foundation model | Buy / use open weights | Almost never |
| Embedding model | Buy / open-source | Strong domain-specific need |
| Vector store | Buy / OSS | Almost never |
| Eval harness | Start with vendor/OSS | Own rubrics or workflow demand more |
| Orchestration | Use framework early | Long-term abstractions or framework rot justify custom |
| Observability | Buy / existing platform | Need product-specific depth |
| Serving | vLLM / TGI / SageMaker | Almost never |
| Prompt management | Vendor / OSS | Versioning + A/B needs are custom |
Phased rollout¶
| Phase | Goal | Exit criteria | Typical duration |
|---|---|---|---|
| Spike | Prove the model can do the task | One end-to-end demo with messy code | 1-2 weeks |
| PoC | Prove value with real data | Quality bar agreed; offline evals defined | 2-4 weeks |
| Pilot | Real users, narrow scope | SLOs met on small cohort; no sev-1 issues for 2 weeks | 4-8 weeks |
| GA | Broad rollout | Cost / latency / quality budgets defended; on-call in place | Ongoing |
- Anti-pattern: skipping phases under deadline pressure.
Team shapes¶
| Shape | When to use | Headcount | Trade-off |
|---|---|---|---|
| Embedded AI engineers | Many product surfaces; AI is feature, not product | 1-2 per product team | Fast velocity; weak shared infra |
| Central AI / applied ML team | Few high-leverage AI surfaces | 4-8 in one team | Strong craft; risk of bottleneck |
| Platform + applied split | 10+ AI engineers; shared infra justified | 4-6 platform, 6-12 applied | Scales well; needs good contracts |
| + Research arm | Frontier moat required | 2-4 researchers | Expensive; only worth it for true asymmetry |
- Default for Series B / mid-stage: central applied team of 4-6 + 1 platform-leaning engineer.
- Split into platform + applied around 10 engineers.
Roles inside the team¶
| Role | What they own | Hire when |
|---|---|---|
| AI Engineer (applied) | Features, prompts, tools, evals | Day 1 |
| AI Platform Engineer | Serving, eval harness, observability | Once 2+ applied engineers exist |
| MLOps / Infra | GPUs, deployment, cost, environments | Self-hosting or multi-model complexity starts |
| ML Engineer (classical) | Traditional ML pipelines | Work truly demands it |
| Research / Applied Scientist | Novel methods, papers | Competitive asymmetry requires it |
| Eval / Quality specialist | Gold sets, review workflow, rubrics | Eval volume justifies specialization |
Prioritizing initiatives¶
| Dimension | Question |
|---|---|
| Value | If this works, what business/user impact appears? |
| Confidence | Have we or peers shipped this before? |
| Leverage | Does this unblock other work? |
| Reversibility | Can we roll it back cleanly? |
| Capacity | Do we have people, evals, and ops to support prod? |
- Default: ship one high-confidence, high-leverage win first; then take a bigger swing.
Cross-functional contracts¶
| Partner | We owe | They owe |
|---|---|---|
| PM | Capability bounds; eval-backed claims | Acceptance criteria; user data access |
| Design | Failure-state UX; latency expectations | Streaming / refusal patterns |
| Data Eng | Schema + freshness needs | Reliable pipelines; quality SLAs |
| Security / Privacy | Threat model; data flow | Approved vendors; data classification |
| Legal / Compliance | Use-case and vendor list | Approval cadence; risk register |
| Finance | Cost forecast; unit economics | Budget visibility; chargeback if needed |
Cost equation¶
$ per call ~= (input_tokens x $/1M_input)
+ (output_tokens x $/1M_output)
+ (cached_input x cache discount)
- Output tokens are often the dominant cost.
- Token counts compound fast: system prompt + few-shot + history + retrieved context.
- Always log actual token usage.
8-lever cost playbook¶
| Lever | Effort | Typical impact | Notes |
|---|---|---|---|
| 1. Cap output tokens | Minutes | 10-40% | Lower max_tokens by endpoint |
| 2. Prompt caching | Hours to a day | 30-70% | Put stable prefixes first |
| 3. Model routing | Days | 30-80% | Cheap-for-easy, expensive-for-hard |
| 4. Prompt compression | Days | 10-30% | Remove dead context; summarize history |
| 5. Batch API | Hours | Up to 50% | Use for async workloads |
| 6. RAG over long context | Days to weeks | 30-70% | Retrieve chunks instead of stuffing docs |
| 7. Self-host open weights | Weeks + ops | 50-90% at high utilization | Only if utilization math works |
| 8. Distill / fine-tune small models | Weeks | 60-90% per inference | Needs strong dataset + eval discipline |
Self-host break-even¶
API cost / month = QPS x 86_400 x 30 x avg_tokens x $/token
Self-host cost = (GPUs x $/hour x 730) + ops_overhead
- Illustrative heuristics:
- One H100 ~= \(1.5K-\)3.5K/month depending on pricing/utilization assumptions.
- Self-hosting usually becomes interesting once spend is well above hobby scale and workload is stable.
- Practical rule:
- < $20K/month: stay on vendor APIs.
- \(50K-\)100K+/month: self-hosting deserves real math.
- Between: model case by case.
Executive dashboard¶
| Panel | Why it matters |
|---|---|
| $ this month vs forecast | Shows budget control |
| $ by feature / team | Shows where spend concentrates |
| $ per active user | Shows unit economics |
| Cache hit rate | Measures prompt-caching leverage |
| Tokens by model tier | Measures routing efficiency |
| Avg output tokens / call | Measures answer-length control |
| % on Batch API | Measures async discount usage |
Governance practices¶
| Practice | What it means |
|---|---|
| Per-feature budgets | Soft alerts at 75%; hard cutoffs at 100% unless overridden |
| Pre-launch cost review | Every AI feature gets a forecast before code freeze |
| Quarterly cost-down sprint | Dedicated time for lever work |
| Vendor pricing watch | Track price changes; renegotiate when spend warrants |
| Per-PR cost diff | CI estimates $/call delta on prompt or model changes |
| Eval gates before rollout | Prompt/model changes need offline checks |
| Audit logs for sensitive actions | Needed for trust, security, and compliance |
| Kill switches + rollback paths | Required for high-stakes AI features |
Useful interview answer templates¶
- Build-vs-buy: Default to prompting; then RAG; then fine-tune only with clean dataset + evals; self-host only when volume or residency justifies ops burden.
- Cost: Make cost visible by feature/model/user; then run levers in order: output caps, caching, routing, compression, batch, retrieval, self-host, distill.
- Team design: Start central applied team with one platform-leaning engineer; split once shared infra becomes load-bearing.