Skip to content

13. Honest admission — prompt engineering still has sharp unknowns

~12 min read. Good prompt engineers sound confident about process, and honest about limits.

Built on the ELI5 in 00-eli5.md. The Revision ledger — the record of what worked before — helps, but it does not give us a full theory of why prompts behave the way they do.


Prompt behavior is still brittle

Look. Small wording changes can cause large output shifts. Sometimes that is useful. Sometimes it is maddening. One extra sentence can improve refusal behavior. Another tiny sentence can break format compliance. The system feels high leverage, but also brittle. That is an honest description.

Picture first.

prompt v12                         prompt v13
┌──────────────────────┐          ┌──────────────────────┐
│ same model           │          │ same model           │
│ same task            │          │ one tiny wording diff│
└──────────┬───────────┘          └──────────┬───────────┘
           ▼                                 ▼
      stable answer?                    maybe yes, maybe no

Simple, no? The Standing rulebook and Work order give control, but not perfect control. We still do not have a full predictive science for prompt phrasing. We have strong heuristics. We have experiments. We have practice. But we do not have tidy laws that say, "This wording will always beat that wording across models and tasks."

Now what is the problem? Practitioners sometimes oversell certainty. They speak as if one prompt recipe is universal. It is not. Prompt behavior depends on model family, training mix, context length, sampling, and surrounding system design. That is why humility is part of the craft.

We still lack a full theory for why some prompts work

See. We know many patterns that often help. Layered system prompts. Few-shot examples. Clear schemas. Ordered reasoning checks. Visible delimiters. But the exact mechanism is often partly empirical. Why does one negative example help a lot on model A, and barely help on model B? Why do XML tags help more in one family than another? Why does a prompt variant that wins offline sometimes lose online? These are live questions.

known practice                        unknown depth
┌──────────────────────┐             ┌──────────────────────┐
│ examples help often  │             │ why this exact phrasing│
│ schemas help often   │             │ transfers poorly?      │
│ structure helps often│             │ interaction with model?│
└──────────────────────┘             └──────────────────────┘

The Revision ledger can tell us what happened. It cannot always tell us why at a mechanistic level. That is okay. Engineering often starts with empirical control before deep theory catches up. Still, we should not pretend the theory is complete.

One visible gap is cross-model transfer. A prompt that works beautifully on Claude may need rework on GPT or Gemini. A prompt tuned for one smaller model may fail on a larger one because the model interprets examples differently. Prompt engineering is still partly model-specific operations work. That is inconvenient, but true.

Evals still miss important failures

Now what is another honest problem? Our eval sets are never the full world. A prompt can pass curated tests, then fail on a weird live input. A security boundary can look good, then collapse under a novel injection pattern. A chain can behave well in English, then break in mixed-language traffic. This is why prompt quality is never final.

Picture first.

offline eval set
looks strong
live traffic surprises
 ├── edge language
 ├── weird formatting
 ├── hostile input
 └── unseen domain mix

Simple, no? We evaluate what we can see. But reality is wider. That is why canaries, monitoring, and rollback remain essential. The Reply form may pass all your golden tests, yet still fail when a customer pastes a spreadsheet dump. The Standing rulebook may look secure, yet still wobble against a new attack string. This is normal. Not comforting. But normal.

What to say honestly in interviews and design reviews

A strong senior answer sounds like this. "Prompts are high-leverage interfaces, but they are brittle and empirical. I trust measured patterns, not folklore. I version prompts, evaluate them, and keep rollback ready because I do not assume one clever prompt is permanently solved." That is a grown-up answer.

Here is a tiny worked example. Suppose version A says, "Return exactly one label." Version B says, "Return one concise label." That one adjective, "concise," may trigger explanatory text on some models. Offline tests may miss it. Live parsers may break. Why exactly did that happen? We can guess. We can inspect behavior. But we may not get a neat universal theorem. So what to do? Measure, then lock the safer version.

Possible outputs.

Prompt A output:
billing

Prompt B output:
billing — likely duplicate charge after renewal

See. Tiny wording. Meaningful behavior shift. That is why prompt engineering demands humility plus discipline. Not cynicism. Not hype. Both are lazy.


Where this lives in the wild

  • Anthropic and OpenAI application teams — prompt patterns that work on one model release may need retuning on the next, so prompt owners keep regression suites and rollout caution.
  • GitHub Copilot product engineers — code-assistance prompts can behave differently across languages, repositories, and context sizes, which makes universal prompt recipes unrealistic.
  • Perplexity and RAG builders — citation prompts that look strong offline can still miss odd retrieval failures or web-page injection patterns in live traffic.
  • Intercom Fin and customer-support AI teams — one prompt can improve containment while quietly harming edge-case escalation quality, so metrics and canaries stay necessary.
  • Enterprise AI platform owners — compliance prompts may pass internal review but still face new red-team attacks, mixed-language inputs, or vendor-model shifts.

Pause and recall

  • Why is prompt engineering still described as brittle and empirical?
  • What kinds of unknowns remain even after good versioning and evals?
  • Why do prompt wins often fail to transfer cleanly across models?
  • What is the honest senior-level posture toward prompt reliability?

Interview Q&A

Q: Why is prompt engineering still considered empirical rather than fully theory-driven? A: Because many prompt effects are observed through experiments and heuristics without a complete predictive theory that transfers across models, settings, and tasks.

Common wrong answer to avoid: "Because researchers are lazy." The challenge is structural complexity, not laziness.

Q: Why should senior teams keep rollback ready even after strong offline evals? A: Offline evals cover only sampled scenarios. Live traffic exposes new edge cases, distribution shifts, and adversarial inputs that curated tests may miss.

Common wrong answer to avoid: "Because evals are useless." Evals are useful. They are simply incomplete.

Q: Why might a prompt that wins on one model family fail on another? A: Different models respond differently to wording, delimiters, example style, and context ordering because their training and internal behaviors are not identical.

Common wrong answer to avoid: "Because one model is smarter." Capability differences matter, but prompt transfer is also about behavioral differences, not just raw intelligence.

Q: Why is humility a practical engineering advantage here, not just a personality trait? A: Humility encourages versioning, measurement, canaries, and rollback. Overconfidence leads teams to ship brittle prompt changes without safeguards.

Common wrong answer to avoid: "Humility means not making decisions." It means making evidence-based decisions with uncertainty acknowledged.


Apply now (5 min)

Exercise. Write three honest sentences you could say in a design review. One about brittleness. One about eval limits. One about rollback. Then check whether each sentence still sounds practical, not defeatist.

Sketch from memory. Draw a triangle. Put prompt on one corner, model on one corner, and traffic on one corner. In the middle write, "Empirical system." Underline Revision ledger as the stabilizer.


Bridge. You now have the full prompt-engineering stack: design, structure, testing, debugging, and honest limits. Next, carry that discipline into a real integrated build. → ../../33_capstone_project/00-eli5.md