06. Structured output — make the reply machine-usable¶

~13 min read. Free text is easy to admire and hard to automate. Structured output fixes that.

Built on the ELI5 in 00-eli5.md. The Reply form — the exact sheet the contractor must fill — is what turns a prompt into a reliable interface.

Why free-form output breaks systems¶

Look. Humans can tolerate variation. Parsers cannot. A customer-support analyst can read, "Probably eligible, contact billing." A downstream workflow may need, eligibility_status, reason_code, and next_action. If the model returns prose instead, your integration becomes guesswork.

Picture first.

free text answer                     structured answer
┌──────────────────────┐            ┌────────────────────────┐
│ maybe eligible       │            │ {                      │
│ check with billing   │            │   status: not_eligible │
│ depends on usage     │            │   reason: usage_limit  │
└──────────┬───────────┘            │ }                      │
           ▼                        └──────────┬─────────────┘
   human can interpret                         ▼
                                        code can route safely

Simple, no? The Reply form is not about making answers look tidy. It is about making them dependable. It defines what keys exist, what values are allowed, and what missing information should look like. That is interface design.

Now what is the problem? Teams often ask, "Return JSON please." Then they act surprised when the model adds commentary, uses different field names, or returns invalid nesting. A casual request produces casual compliance. Production systems need stronger guidance.

Three ways to force structure¶

The first method is pure prompt instruction. You say exactly what fields to return. You may show one example. This is better than nothing. But it still relies on text imitation alone.

The second method is schema-guided generation. You define a strict schema. Maybe JSON schema. Maybe Pydantic. Maybe a typed tool call. The API can then constrain or validate the output. This is much stronger. The prompt and the runtime work together.

The third method is delimiter-led structure. Instead of raw JSON, you may use XML tags, Markdown sections, or line-based key-value blocks. This is useful when exact JSON is awkward, or when the model responds well to explicit tags. Anthropic Claude workflows often benefit from XML-tagged blocks because the structure is visually separable.

Here is the comparison.

prompt only                      schema-guided                    tagged structure
┌───────────────────┐            ┌────────────────────┐           ┌──────────────────┐
│ return json       │            │ schema validates   │           │ <answer>...</... │
│ model imitates    │            │ runtime constrains │           │ easy to inspect   │
└─────────┬─────────┘            └──────────┬─────────┘           └─────────┬────────┘
          ▼                                 ▼                                ▼
     flexible, fragile                 stricter, safer                 readable compromise

So what to do? Use the strongest method your stack supports. If the output feeds code, prefer schema-guided generation. If humans read it but still need stable sections, use tagged structure. If you must rely on plain prompting, make the shape extremely explicit.

What a good reply form looks like¶

A good Reply form has clear keys. It has allowed values. It has null handling. It has rules for extra text. Often the best rule is, "Return only the object. No preamble." That one line saves many parser bugs.

Say the task is refund triage. Weak output instruction: "Answer in JSON." Stronger output instruction: "Return a JSON object with keys eligibility, reason_code, requires_human_review, and customer_message. Use eligible, not_eligible, or unknown for eligibility. Return only the JSON object." See the difference? One asks politely. The other defines a contract.

The Standing rulebook should also align with the structure. If the system says, "Use only provided evidence," then the schema might need a missing_information field. If the system allows abstention, then the schema should have a place for abstention. Otherwise the model squeezes uncertainty into the wrong field.

Worked example — prose answer to schema-safe answer¶

Suppose an internal operations tool needs a routing object. Input case: Customer is on annual enterprise plan. Renewed 12 days ago. Usage since renewal: 7,100 API calls. Question: Wants refund.

Weak prompt first.

Read the case and return JSON about the refund decision.

Possible model response.

Here is the JSON:
{
  "status": "probably not eligible",
  "reason": "usage seems high",
  "next_step": "check with billing"
}

Now what is wrong? It added commentary. The enum is not controlled. "Probably not eligible" is not a stable state. "Usage seems high" is vague. The output is human-friendly, not system-friendly.

Now the stronger version.

[SYSTEM]
You are a billing policy classifier.
Use only the provided case facts.
Return only valid JSON.

[SCHEMA]
{
  "type": "object",
  "properties": {
    "eligibility": {"type": "string", "enum": ["eligible", "not_eligible", "unknown"]},
    "reason_code": {"type": "string", "enum": ["window_ok_usage_limit_exceeded", "window_expired", "insufficient_information", "other"]},
    "requires_human_review": {"type": "boolean"},
    "customer_message": {"type": "string"}
  },
  "required": ["eligibility", "reason_code", "requires_human_review", "customer_message"],
  "additionalProperties": false
}

[CASE]
Plan: enterprise annual
Days since renewal: 12
API calls since renewal: 7100
Question: customer requests refund

Possible model response.

{
  "eligibility": "not_eligible",
  "reason_code": "window_ok_usage_limit_exceeded",
  "requires_human_review": false,
  "customer_message": "The request is within the renewal window, but the account has exceeded the usage limit for refunds under the provided policy."
}

Simple, no? The Reply form did real work. It narrowed the allowed states. It created parser safety. It made analytics easier too. Now you can count reason_code values across traffic. That is hard with loose prose.

One more production tip. Keep schemas boring. Tiny, stable, and explicit. A gigantic schema inside the prompt can become its own failure source. Use nested objects only when the workflow truly needs them. Otherwise, flat fields are easier to validate, version, and debug.

XML tags and hybrid structures¶

JSON is not the only useful target. Sometimes you want readable sections. For long-form analysis, XML or tagged blocks work well. For example, <answer>, <citations>, and <uncertainty>. Claude-style prompts often respond cleanly to such boundaries.

That is still a Reply form. Just a different one. The question is not, "JSON or XML forever?" The question is, "What structure best serves the next consumer?" A parser? A human reviewer? A tool chain? Pick accordingly.

Where this lives in the wild¶

OpenAI structured outputs — application engineers use JSON schema or function-calling style contracts so the model returns machine-safe objects instead of conversational prose.
Anthropic Claude — prompt designers often use XML-tagged sections like <instructions>, <answer>, and <citations> because Claude tends to track tagged boundaries well in long prompts.
GitHub Copilot extensions and agent flows — tool-calling pipelines depend on stable argument fields, not poetic text, so prompts and runtime schemas work together.
LangChain and Pydantic-based apps — backend engineers wrap model outputs with typed validators to catch or repair invalid fields before business logic runs.
Stripe-style internal support automation — operations tools benefit from enums like eligible, not_eligible, and requires_review because routing logic must be deterministic.

Pause and recall¶

Why is "return JSON" weaker than a real schema contract?
What three methods can enforce structure?
Why should uncertainty have an explicit field instead of leaking into prose?
When might XML tags be a better reply form than raw JSON?

Interview Q&A¶

Q: Why prefer schema-guided output over prompt-only JSON instructions for production systems? A: Schema guidance reduces ambiguity, constrains allowed values, and makes validation easier. Prompt-only JSON still depends heavily on imitation quality.

Common wrong answer to avoid: "Because models cannot generate JSON without a schema." They can. The issue is reliability under scale and edge cases.

Q: Why should enums and required fields be explicit in the output contract? A: Explicit enums shrink the state space and make downstream routing, analytics, and evaluation much more reliable.

Common wrong answer to avoid: "Because longer schemas impress the model." The goal is constraint, not intimidation.

Q: Why can free-form explanations be harmful when an output feeds code? A: Code needs stable fields and predictable values. Free-form text mixes signal, uncertainty, and commentary in ways parsers cannot safely interpret.

Common wrong answer to avoid: "You can always regex it later." Regex repair is brittle and expensive compared with proper structure.

Q: Why might a tagged XML format beat JSON in some prompts? A: Tagged sections can be easier for both models and human reviewers to follow in long analytical outputs, especially when the runtime does not need strict object parsing.

Common wrong answer to avoid: "XML is modern and JSON is old." This is about task fit, not fashion.

Apply now (5 min)¶

Exercise. Take one free-text AI output from your domain. Turn it into a four-field schema. Add one enum. Add one boolean. Add one string for user-facing text. Then write the rule, "Return only the object."

Sketch from memory. Draw the split between free text and Reply form. On one side write, "human guesses meaning." On the other side write, "code routes by keys." Mark the schema as the contract.

Bridge. A strong reply form fixes shape. But the same prompt can still sound rigid or wildly creative depending on sampling. So next we turn the creativity dial and see what it really changes. → 07-temperature-sampling.md