09. Sprint Planning for Research-Heavy AI Work — Plan the unknown without pretending it is known¶
~17 min read. Good AI sprints answer risky questions before they promise clean delivery.
Built on the ELI5 in 00-eli5.md. The weather check — reminder that risk must be scanned before sailing — keeps research planning honest.
1) Why normal sprint plans break¶
Look. A feature sprint assumes the path is already visible. Research-heavy AI work rarely behaves that way.
You may know the user problem and the deadline. But the model behavior is still foggy.
One prompt tweak may help, while another fails badly. A vendor API may change cost or latency.
So what happens in weak planning? Teams write feature-looking tasks for question-looking work. Then the sprint board becomes theater.
planned like certainty
│
├── fake estimates
├── hidden risk
├── vague ownership
├── panic near demo
└── broken trust
See. Research does not fail planning because research is bad. It fails planning when we pretend uncertainty is gone.
That is where the weather check matters. Before choosing tasks, name the storms. Before promising dates, name the unknowns.
The course should reflect learning goals. Not wishful engineering fiction. Simple, no?
2) Plan around questions, not fantasy certainty¶
A better sprint plan starts with questions. Not just deliverables.
Ask things like: - Can this model summarize invoices within acceptable error bounds? - Can we ground answers from our document set reliably enough? - Can we hit latency targets with retrieval turned on? - Can humans review failures without slowing the workflow too much?
Each question needs a time box, an owner, and exit criteria.
Exit criteria means, "At the end, we will know enough to continue, pivot, or stop."
That is the real deliverable. Learning with a decision attached. Not just activity.
See the difference. The crew can now row in one direction. The ship's log can capture what was learned.
Without this structure, research turns into wandering. With this structure, research becomes decision-making work.
The weather check appears again here. If the risk is large, shrink the question. Do not stretch the sprint.
3) Use the four-stage flow¶
Most research-heavy AI delivery fits four stages. Exploration spike, prototype, hardening, and launch.
Look at the flow.
┌──────────────┐
│ exploration │
└──────┬───────┘
│ answer key questions
▼
┌──────────────┐
│ prototype │
└──────┬───────┘
│ prove value shape
▼
┌──────────────┐
│ hardening │
└──────┬───────┘
│ reduce operational risk
▼
┌──────────────┐
│ launch │
└──────────────┘
Stage one is not about polish. It asks whether the idea works and under what conditions.
Stage two proves a narrow user path. Not the whole platform. Not every edge case.
Stage three handles the adult problems. Evaluation quality, failure modes, monitoring, cost, fallbacks, and security review.
Stage four means you are ready to carry traffic. Now the course is clearer. Now estimates start behaving better.
A common mistake is skipping straight to hardening talk. When the core research question is still unanswered. Yes? That only creates expensive confusion.
4) Exit criteria protect the sprint¶
A research sprint needs stronger endings than, "We learned a lot." That sentence sounds wise. It often means nothing actionable happened.
So what to do? Write exit criteria that force a fork: continue, change direction, or stop.
Good exit criteria sound like this: - If accuracy stays below threshold on the chosen dataset, stop this path. - If retrieval improves answer quality by a clear margin, move to prototype. - If latency exceeds the user tolerance, test a smaller context strategy. - If review load is too high, redesign the human-in-the-loop step.
See. Every line ends in a decision. That is why the compass matters. It turns evidence into movement.
The ship's log matters too. Write what was tested. Write what was excluded. Write what counts as good enough.
Then the next sprint starts from memory, not from repeated debate. The crew wastes less energy. Trust improves.
5) A practical board for one sprint¶
Picture one sprint with three research lanes. Question lane. Evidence lane. Decision lane.
┌────────────────┬────────────────┬────────────────┐
│ question │ evidence │ decision │
├────────────────┼────────────────┼────────────────┤
│ can RAG help? │ eval results │ prototype / no │
│ can cost fit? │ usage traces │ proceed / trim │
│ can review fit?│ reviewer notes │ redesign / no │
└────────────────┴────────────────┴────────────────┘
This board is useful because it removes fantasy progress. A task is not green because effort happened. A task is green because a decision became possible.
Now add one more habit. Limit active questions. If ten unknowns run together, nothing finishes well.
Better to answer two hard questions cleanly. Than touch eight questions badly. Simple, no?
The course becomes legible. The weather check becomes visible. The ship's log becomes useful for the next sprint.
That is outcome-based planning. Not certainty theater. Disciplined exploration.
Where this lives in the wild¶
- GitHub Copilot coding workflow — engineering manager plans a spike on repository indexing quality before promising agent-wide rollout.
- Intercom AI support assistant — product lead time-boxes whether retrieval actually reduces ticket resolution time.
- Banking document copilot — legal and ML lead tests redaction reliability before approving a prototype for analysts.
- Sales enablement proposal writer — solutions architect validates citation quality before hardening CRM integration.
- Healthcare summarization tool — clinical operations owner checks reviewer burden before claiming end-to-end automation.
Pause and recall¶
- Why do research-heavy AI sprints fail when planned like normal feature sprints?
- What five fields should every research question card include?
- Why is an exploration spike not the same as a prototype?
- What makes exit criteria useful instead of decorative?
Interview Q&A¶
Q: Why should research-heavy AI work be planned around questions instead of fixed feature tasks? A: Because the main uncertainty is whether the approach can work well enough. Question-based planning makes learning, evidence, and decisions visible.
Common wrong answer to avoid: "Because AI engineers are bad at estimation" — the deeper issue is unresolved uncertainty, not weak discipline alone.
Q: What is the purpose of exit criteria in a research sprint? A: Exit criteria define what evidence is enough to continue, pivot, or stop. They prevent endless exploration without decisions.
Common wrong answer to avoid: "They help people work faster" — speed is secondary to decision quality and scope control.
Q: Why separate exploration, prototype, hardening, and launch? A: Each stage answers a different question. Mixing them causes premature polish, hidden risk, and bad promises.
Common wrong answer to avoid: "Because that is the standard product lifecycle" — the point is stage-specific uncertainty reduction, not ceremony.
Q: Why should the ship's log be part of sprint planning for AI work? A: Research findings decay quickly when left in chat threads. Written decisions preserve assumptions, evidence, and rejected paths.
Common wrong answer to avoid: "Only for compliance" — compliance may matter, but team memory and repeatability matter first.
Apply now (5 min)¶
Exercise: Take one AI feature idea from your current work. Write three research questions for the next sprint. Add an owner, time box, test method, and exit criteria for each.
Sketch from memory: Draw the flow from exploration spike to prototype, hardening, and launch. Then label where the weather check, the course, and the ship's log appear.
Bridge. Good planning still fails if the message lands badly. The next skill is translating uncertainty without distortion. → 10-stakeholder-communication.md