05. Lead Frameworks¶

Hiring rubric¶

Levels at a glance¶

Level	Years	Scope	Headline expectation
Mid (L4-equiv)	3-5	Owns features inside an existing AI product	Can ship a RAG / agent feature with reasonable evals
Senior (L5-equiv)	5-8	Owns subsystems; mentors mids	Designs and ships AI subsystems end-to-end with production discipline
Staff (L6-equiv)	8-12	Cross-team technical influence	Sets technical direction; resolves cross-team architecture
Lead / Principal (L7+ or first IC at small company)	10+	Org-level technical vision	Defines AI strategy; hires and grows the function

8-area rubric¶

Area	Probe	Mid bar	Senior / Lead bar	Green signal	Red signal
LLM fundamentals	Explain transformers; what is KV cache?	Correct mental model	Trade-offs, memory math, architecture implications	Derives from first principles	Buzzwords only
Prompting & orchestration	Reduce hallucinations in a customer assistant	Few-shot, structure, basic guardrails	Retrieval, validators, refusal policy, cost/latency trade-offs	Real production failure mode discussed	"Just use better prompts"
RAG	Design enterprise wiki RAG; debug 20% wrong answers	Chunk, embed, retrieve, augment prompt	Hybrid retrieval, rerank, citations, evals, freshness and when not to RAG	Separates retrieval vs generation failures	Cannot isolate failure class
Agents	What are tool calls; how do you stop loops?	Single-agent loop	Iteration caps, planner/executor, idempotency, escalation, when not to use agents	Has specific war stories	Treats every problem as agentic
Evals	How do you evaluate an LLM feature before launch?	Offline eval set, basic metrics	LLM-as-judge, human review, drift detection, regression gates, blast radius	Evals treated as load-bearing	"We tried it and it looked good"
MLOps / production	LLM bill up 40% with flat traffic — debug	Can name logs/metrics	Serving stacks, batching, KV cache, SLOs, dashboards, rollouts	Has operated or designed on-call systems	Has only called APIs
Judgment / build-vs-buy	Prompt vs RAG vs fine-tune vs self-host	Understands basic tree	Adds vendor risk, data sensitivity, talent cost, real numbers	Prefers boring solution first	Defaults to trendiest option
Communication & collaboration	Behavioral: disagreed with PM about an AI feature	Clear ownership and trade-offs	Can translate across research, eng, product, legal, finance	Structured, calm, explicit about trade-offs	Fuzzy or combative

Calibration questions¶

Walk me through the worst AI feature you shipped. What did you learn?
What is the smallest change that would have most improved your last AI project?
What AI capability do you actively distrust?

Candidate anti-patterns¶

Treats every LLM problem as a prompting problem.
Cannot separate retrieval failure from generation failure.
Builds agents when deterministic flows would do.
Quotes benchmarks without understanding them.
Has never operated a production service.

Build / buy / fine-tune decision tree¶

1. Can prompting alone solve it?
   yes -> stop.
   no  -> 2

2. Is the missing ingredient knowledge?
   yes -> RAG.
   no  -> 3

3. Is the missing ingredient behavior / style / format reliability?
   yes -> 4
   no  -> 5

4. Is prompt engineering + few-shot enough?
   yes -> stop.
   no  -> fine-tune (LoRA first; full FT only if needed).

5. Is the missing ingredient latency / cost / data residency?
   yes -> distill + self-host small model.
   no  -> reframe; you may not need ML.

When each option wins¶

Prompting: fastest iteration; vendor quality improving; no clean dataset.
RAG: knowledge changes often; knowledge is private; citations required; clean retrieval index possible.
Fine-tuning: stable format/style needed; prompt needs compressing; tight latency budget; good dataset + eval set available.
Distill + self-host: high-volume stable workload; cost/latency is binding; data sensitivity matters; GPU + ops capacity exists.

Honest fine-tuning costs¶

Dataset curation is the hard part.
Eval set matters more than training set.
LoRA is cheaper than full FT.
Fine-tuned systems still need ongoing eval.
Vendor model improvements can obsolete your FT quickly.

Vendor vs open-weights¶

Factor	Vendor API	Open-weights
Time to first prototype	Hours	Days to weeks
Peak quality	Highest available	Improving; usually lags frontier vendors
Cost at scale	Per-token, predictable	Better at high utilization; fixed GPU cost
Latency control	Vendor-bounded	Yours to tune
Data residency	Vendor terms	Your environment
Lock-in	Higher	Lower
Ops burden	Minimal	Real MLOps + GPUs + on-call
Upgrade cadence	Vendor upgrades automatically	You choose and manage upgrades

Default: prototype on vendor; evaluate self-hosting only when workload justifies it.

Build vs buy across the stack¶

Layer	Default choice	Build only when...
Foundation model	Buy / use open weights	Almost never
Embedding model	Buy / open-source	Strong domain-specific need
Vector store	Buy / OSS	Almost never
Eval harness	Start with vendor/OSS	Own rubrics or workflow demand more
Orchestration	Use framework early	Long-term abstractions or framework rot justify custom
Observability	Buy / existing platform	Need product-specific depth
Serving	vLLM / TGI / SageMaker	Almost never
Prompt management	Vendor / OSS	Versioning + A/B needs are custom

Phased rollout¶

Phase	Goal	Exit criteria	Typical duration
Spike	Prove the model can do the task	One end-to-end demo with messy code	1-2 weeks
PoC	Prove value with real data	Quality bar agreed; offline evals defined	2-4 weeks
Pilot	Real users, narrow scope	SLOs met on small cohort; no sev-1 issues for 2 weeks	4-8 weeks
GA	Broad rollout	Cost / latency / quality budgets defended; on-call in place	Ongoing

Anti-pattern: skipping phases under deadline pressure.

Team shapes¶

Shape	When to use	Headcount	Trade-off
Embedded AI engineers	Many product surfaces; AI is feature, not product	1-2 per product team	Fast velocity; weak shared infra
Central AI / applied ML team	Few high-leverage AI surfaces	4-8 in one team	Strong craft; risk of bottleneck
Platform + applied split	10+ AI engineers; shared infra justified	4-6 platform, 6-12 applied	Scales well; needs good contracts
+ Research arm	Frontier moat required	2-4 researchers	Expensive; only worth it for true asymmetry

Default for Series B / mid-stage: central applied team of 4-6 + 1 platform-leaning engineer.
Split into platform + applied around 10 engineers.

Roles inside the team¶

Role	What they own	Hire when
AI Engineer (applied)	Features, prompts, tools, evals	Day 1
AI Platform Engineer	Serving, eval harness, observability	Once 2+ applied engineers exist
MLOps / Infra	GPUs, deployment, cost, environments	Self-hosting or multi-model complexity starts
ML Engineer (classical)	Traditional ML pipelines	Work truly demands it
Research / Applied Scientist	Novel methods, papers	Competitive asymmetry requires it
Eval / Quality specialist	Gold sets, review workflow, rubrics	Eval volume justifies specialization

Prioritizing initiatives¶

Dimension	Question
Value	If this works, what business/user impact appears?
Confidence	Have we or peers shipped this before?
Leverage	Does this unblock other work?
Reversibility	Can we roll it back cleanly?
Capacity	Do we have people, evals, and ops to support prod?

Default: ship one high-confidence, high-leverage win first; then take a bigger swing.

Cross-functional contracts¶

Partner	We owe	They owe
PM	Capability bounds; eval-backed claims	Acceptance criteria; user data access
Design	Failure-state UX; latency expectations	Streaming / refusal patterns
Data Eng	Schema + freshness needs	Reliable pipelines; quality SLAs
Security / Privacy	Threat model; data flow	Approved vendors; data classification
Legal / Compliance	Use-case and vendor list	Approval cadence; risk register
Finance	Cost forecast; unit economics	Budget visibility; chargeback if needed

Cost equation¶

$ per call ~= (input_tokens x $/1M_input)
          + (output_tokens x $/1M_output)
          + (cached_input x cache discount)

Output tokens are often the dominant cost.
Token counts compound fast: system prompt + few-shot + history + retrieved context.
Always log actual token usage.

8-lever cost playbook¶

Lever	Effort	Typical impact	Notes
1. Cap output tokens	Minutes	10-40%	Lower `max_tokens` by endpoint
2. Prompt caching	Hours to a day	30-70%	Put stable prefixes first
3. Model routing	Days	30-80%	Cheap-for-easy, expensive-for-hard
4. Prompt compression	Days	10-30%	Remove dead context; summarize history
5. Batch API	Hours	Up to 50%	Use for async workloads
6. RAG over long context	Days to weeks	30-70%	Retrieve chunks instead of stuffing docs
7. Self-host open weights	Weeks + ops	50-90% at high utilization	Only if utilization math works
8. Distill / fine-tune small models	Weeks	60-90% per inference	Needs strong dataset + eval discipline

Self-host break-even¶

API cost / month = QPS x 86_400 x 30 x avg_tokens x $/token
Self-host cost   = (GPUs x $/hour x 730) + ops_overhead

Illustrative heuristics:
One H100 ~= $1.5K-$3.5K/month depending on pricing/utilization assumptions.
Self-hosting usually becomes interesting once spend is well above hobby scale and workload is stable.
Practical rule:
< $20K/month: stay on vendor APIs.
$50K-$100K+/month: self-hosting deserves real math.
Between: model case by case.

Executive dashboard¶

Panel	Why it matters
$ this month vs forecast	Shows budget control
$ by feature / team	Shows where spend concentrates
$ per active user	Shows unit economics
Cache hit rate	Measures prompt-caching leverage
Tokens by model tier	Measures routing efficiency
Avg output tokens / call	Measures answer-length control
% on Batch API	Measures async discount usage

Governance practices¶

Practice	What it means
Per-feature budgets	Soft alerts at 75%; hard cutoffs at 100% unless overridden
Pre-launch cost review	Every AI feature gets a forecast before code freeze
Quarterly cost-down sprint	Dedicated time for lever work
Vendor pricing watch	Track price changes; renegotiate when spend warrants
Per-PR cost diff	CI estimates $/call delta on prompt or model changes
Eval gates before rollout	Prompt/model changes need offline checks
Audit logs for sensitive actions	Needed for trust, security, and compliance
Kill switches + rollback paths	Required for high-stakes AI features

Useful interview answer templates¶

Build-vs-buy: Default to prompting; then RAG; then fine-tune only with clean dataset + evals; self-host only when volume or residency justifies ops burden.
Cost: Make cost visible by feature/model/user; then run levers in order: output caps, caching, routing, compression, batch, retrieval, self-host, distill.
Team design: Start central applied team with one platform-leaning engineer; split once shared infra becomes load-bearing.