Structured Outputs from LLMs: JSON Mode and Beyond

2026-04-15·19 min read

Getting an LLM to write a haiku is easy. Getting it to reliably return a JSON object with exactly the right keys, the right types, and no trailing commentary — that is the part that quietly breaks production systems. I have spent the last year wiring structured output into pipelines for data extraction, agentic workflows, and API backends, and the gap between "it works in the playground" and "it works at 2 AM in production" is entirely about how carefully you enforce structure.

This guide covers every practical approach: JSON mode, JSON Schema enforcement, tool use as a schema hack, Pydantic integration, validation strategies, and the error handling that actually saves you. No generic AI filler — just the code and the trade-offs.

Why Structured Output Matters

Free-form LLM output is fine for chat. It is a liability for everything else. When you need to parse a model's response, store it in a database, pass it to another function, or render it in a UI, you need a contract. Without one, you spend half your engineering time writing defensive parsing code — and still get surprised by the model deciding to preface its JSON with "Sure! Here is the JSON you requested:".

The cost of unstructured output compounds fast:

Parsing failures break pipelines silently or loudly depending on how fragile your code is.
Schema drift means a model that worked fine last month suddenly adds a new key or changes a field name, and your downstream breaks.
Retry loops eat latency and money when you have to reprompt because the model formatted things wrong.
Debugging is brutal because the bug is not in your code — it is in a language model's probabilistic output.

Structured output approaches exist to close that gap. The goal is to make the model's output as predictable as a typed function return.

JSON Mode: The Fastest On-Ramp

JSON mode is the simplest form of structure enforcement. You tell the model "output valid JSON" and it constrains its token sampling to ensure the output parses. No hallucinated trailing text, no markdown fences (usually), no half-finished objects.

OpenAI JSON Mode

OpenAI introduced response_format: { type: "json_object" } in the Chat Completions API. Here is a minimal working example:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": "You extract structured data. Always respond with valid JSON."
        },
        {
            "role": "user",
            "content": "Extract: Name, email, and company from this: 'Hi, I'm Sarah Chen at Acme Corp. Reach me at sarah@acme.io'"
        }
    ]
)

import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp"}

Critical caveat: JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON contains the keys you expect, uses the right types, or matches any particular shape. You still need validation on the output.

Claude Prefilling Trick

Anthropic's Claude API supports a prefill pattern where you inject the start of the assistant's response. By prefilling with {, you force the model to continue from an open brace — which strongly nudges it toward valid JSON output. This works even without an explicit JSON mode flag.

import anthropic
import json

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system="You extract structured contact data. Return only JSON, no explanation.",
    messages=[
        {
            "role": "user",
            "content": "Extract contact info: 'Ping me — James, james.t@startup.ai, works at NeuralStack'"
        },
        {
            "role": "assistant",
            "content": "{"  # prefill forces JSON continuation
        }
    ]
)

# The response continues from the "{" we prefilled
raw = "{" + response.content[0].text
data = json.loads(raw)
print(data)
# {"name": "James", "email": "james.t@startup.ai", "company": "NeuralStack"}

The prefill trick is surprisingly robust and works across Claude model versions. The trade-off is that it is a hack — you are exploiting the model's autoregressive nature rather than using a native API feature.

JSON Schema Enforcement

JSON mode gets you syntactically valid JSON. JSON Schema enforcement gets you semantically valid JSON — the right keys, the right types, optional vs. required fields, nested objects with their own constraints.

OpenAI's Structured Outputs feature (distinct from JSON mode) lets you pass a full JSON Schema and the API will guarantee the output conforms to it, using constrained decoding:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ContactInfo(BaseModel):
    name: str
    email: str
    company: str
    role: str | None = None

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "Extract contact information from the text."
        },
        {
            "role": "user",
            "content": "Meet Dr. Priya Nair, CTO at QuantumLeap (priya@ql.dev)"
        }
    ],
    response_format=ContactInfo,
)

contact = response.choices[0].message.parsed
print(contact.name)    # Dr. Priya Nair
print(contact.email)   # priya@ql.dev
print(contact.role)    # CTO

The .parse() method (OpenAI SDK v1.40+) accepts a Pydantic model directly and returns a typed object — no manual JSON parsing needed. The schema is derived automatically from the Pydantic model. Under the hood, OpenAI uses constrained beam search to guarantee schema conformance at the token level.

Tool Use for Structured Output

Before native schema enforcement existed, the standard trick was to define a fake "tool" whose parameters match your desired output schema, then force the model to call it. This still works and is often the best option for Claude, Gemini, and any model that has robust function calling but not native structured outputs.

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "save_contact",
        "description": "Save extracted contact information",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Full name"},
                "email": {"type": "string", "description": "Email address"},
                "company": {"type": "string", "description": "Company name"},
                "role": {"type": "string", "description": "Job title or role"}
            },
            "required": ["name", "email", "company"]
        }
    }
]

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "save_contact"},  # force the tool call
    messages=[
        {
            "role": "user",
            "content": "Extract: 'This is Lena Müller, VP of Eng at DataBridge. lena.m@databridge.com'"
        }
    ]
)

# Tool use response has structured input
tool_use_block = next(b for b in response.content if b.type == "tool_use")
contact = tool_use_block.input
print(contact)
# {"name": "Lena Müller", "email": "lena.m@databridge.com", "company": "DataBridge", "role": "VP of Eng"}

tool_choice: {"type": "tool", "name": "save_contact"} is the important part — it forces the model to call exactly that tool rather than choosing whether to use a tool at all. The model cannot escape into free-form text.

This pattern works on any model that supports function/tool calling. It is my default for Claude-based pipelines.

Comparison: Which Approach to Use When

Approach	Schema Enforced	Type Safety	Model Support	Best For
Prompt only	No	No	All	Prototyping
JSON mode	Syntax only	No	OpenAI, some others	Simple extraction
Claude prefill	Syntax nudge	No	Claude	Quick Claude hacks
Tool use	Yes (JSON Schema)	Partial	OpenAI, Claude, Gemini	Cross-model pipelines
OpenAI Structured Outputs	Yes (full schema)	Yes (with Pydantic)	OpenAI only	Production OpenAI apps

Validation and Error Handling

Even with schema enforcement, you need a validation layer. Constrained decoding handles syntax and shape, but not semantic validity — an email field containing "not-an-email" will still pass JSON Schema unless you add a format constraint or validate after the fact.

Here is the validation pattern I use in production:

from pydantic import BaseModel, EmailStr, field_validator, ValidationError
from openai import OpenAI
import json
import time

client = OpenAI()

class ContactInfo(BaseModel):
    name: str
    email: EmailStr  # Pydantic validates email format
    company: str
    role: str | None = None

    @field_validator("name")
    @classmethod
    def name_not_empty(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("name cannot be empty")
        return v.strip()

def extract_contact(text: str, max_retries: int = 3) -> ContactInfo:
    messages = [
        {"role": "system", "content": "Extract contact info as JSON. Fields: name, email, company, role (optional)."},
        {"role": "user", "content": text}
    ]

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                response_format={"type": "json_object"},
                messages=messages
            )
            raw = json.loads(response.choices[0].message.content)
            return ContactInfo(**raw)

        except json.JSONDecodeError as e:
            if attempt == max_retries - 1:
                raise RuntimeError(f"JSON parse failed after {max_retries} attempts: {e}")
            messages.append({"role": "assistant", "content": response.choices[0].message.content})
            messages.append({"role": "user", "content": f"That was not valid JSON. Error: {e}. Try again."})

        except ValidationError as e:
            if attempt == max_retries - 1:
                raise RuntimeError(f"Validation failed after {max_retries} attempts: {e}")
            messages.append({"role": "assistant", "content": response.choices[0].message.content})
            messages.append({"role": "user", "content": f"Schema validation failed: {e}. Fix and retry."})

        time.sleep(0.5 * (attempt + 1))  # simple backoff

    raise RuntimeError("Unreachable")

# Usage
contact = extract_contact("Talk to Ben Okafor (ben.o@fintech.io), Head of Product at PayLayer")
print(contact.model_dump())

The retry loop appends the failed output and the error back into the conversation so the model understands what went wrong. This self-correction pattern reduces retry count significantly compared to blind retries from scratch.

Pydantic Integration

Pydantic is the natural pairing for structured LLM output in Python. It handles schema generation, type coercion, custom validators, and serialization — all the things you would otherwise write by hand.

For complex schemas with nested models and conditional fields:

from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI

class Address(BaseModel):
    street: str
    city: str
    country: str = "US"

class Lead(BaseModel):
    name: str
    email: str
    company: str
    lead_score: int = Field(ge=0, le=100, description="Estimated fit score 0-100")
    source: Literal["inbound", "outbound", "referral"]
    address: Address | None = None
    tags: list[str] = Field(default_factory=list)

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "Extract lead information. Estimate lead_score based on company signals."
        },
        {
            "role": "user",
            "content": """
            Inbound lead from our website:
            Sofia Esposito, Director of AI at NebulaFinance (sofia.e@nebulafinance.com)
            Based in Milan, Italy. Mentioned they're evaluating 3 vendors.
            Tags: enterprise, fintech, EU
            """
        }
    ],
    response_format=Lead,
)

lead = response.choices[0].message.parsed
print(lead.model_dump_json(indent=2))

The Field(ge=0, le=100) constraint is reflected in the JSON Schema passed to the API — the model cannot return a value outside that range. Nested models like Address are handled recursively.

For generating the JSON Schema manually (useful when you need to pass it to non-OpenAI APIs):

schema = Lead.model_json_schema()
print(schema)  # Full JSON Schema dict — pass this to tool definitions

Validation and Error Handling Workflow

Choosing an Approach

The decision tree I use:

You're using OpenAI and need maximum reliability — use OpenAI Structured Outputs with Pydantic. The constrained decoding guarantee plus typed objects is the highest-confidence path. Works well for schemas with up to ~20 fields.

You're using Claude — use tool use with tool_choice forced to your extraction tool. Define your schema as the tool's input_schema. This is Claude's native structured output mechanism and it is reliable.

You need to support multiple models — tool use is the most portable approach since function calling is standardized across OpenAI, Claude, Gemini, and most open-weight models via APIs like Ollama and vLLM.

You're prototyping fast — JSON mode plus a Pydantic model_validate call is the fastest setup. Accept that you will get occasional schema mismatches in exchange for speed.

You're using open-weight models — look at llama.cpp's --json-schema flag, vLLM's guided decoding (via guided_json), or Outlines for grammar-based constrained generation. These tools bring OpenAI-style schema enforcement to self-hosted models.

# vLLM guided decoding example
from openai import OpenAI  # vLLM exposes OpenAI-compatible API
from pydantic import BaseModel

class Sentiment(BaseModel):
    label: Literal["positive", "negative", "neutral"]
    confidence: float
    reasoning: str

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

schema = Sentiment.model_json_schema()
response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-instruct",
    messages=[{"role": "user", "content": "Analyze: 'This API keeps breaking our pipeline'"}],
    extra_body={"guided_json": schema}  # vLLM-specific
)

Common Pitfalls

Forgetting to mention the schema in the system prompt. Even with API-level enforcement, models perform better when the system prompt explicitly describes the expected output. "Return a JSON object with fields: name (string), email (string), company (string)" beats relying on the schema alone.

Overly complex schemas. Schemas with more than 30 fields, deep nesting, or many oneOf/anyOf branches hit model limitations. Constrained decoding gets slower and more prone to errors. Flatten schemas where possible and split complex extractions into multiple calls.

Not handling null vs. missing keys. JSON Schema required tells the model which keys must appear. But a model might return "role": null instead of omitting the key entirely. Your Pydantic model should use str | None = None not just str for optional fields, and your downstream code should handle both cases.

Using JSON mode for streaming. JSON mode responses are only valid at completion — a streaming partial JSON response is not parseable. Either disable streaming or use a streaming-aware JSON parser like ijson.

Assuming schema enforcement is free. Constrained decoding adds latency, especially for complex schemas. On OpenAI's API, Structured Outputs calls can be 10–30% slower than plain JSON mode calls. Budget for this in latency-sensitive paths.

Not logging raw responses. When a validation error hits production, you want the raw model output to debug it. Always log response.choices[0].message.content before parsing, not just the parsed result.

Verdict

Structured output LLM is not a single feature — it is a spectrum, and where you land on it should match your production requirements.

For most teams building in 2026: start with OpenAI Structured Outputs and Pydantic if you're on OpenAI, or tool use with tool_choice forced if you're on Claude. Add a retry loop with error feedback. Log raw responses. Add Pydantic validators for semantic constraints. That stack handles the overwhelming majority of real extraction, classification, and agentic workflows reliably.

The prefill trick and plain JSON mode are fine for exploration, but they are not something I would wire into a production pipeline that runs thousands of times a day. The extra reliability from real schema enforcement is worth the marginal setup cost.

Open-weight models are closing the gap. vLLM's guided decoding and Outlines are genuinely production-ready for many schemas. If you're self-hosting, test them — the latency overhead is lower than you might expect.

FAQ

Does JSON mode work with streaming?

Not directly. Streaming returns partial JSON tokens that are not individually parseable. For streaming use cases you have two options: buffer the full stream before parsing (giving up the latency benefit of streaming), or use a streaming JSON parser that can handle partial documents. The ijson Python library handles the latter. For most extraction use cases, non-streaming is fine — the latency is dominated by the model computation, not token delivery.

What happens if the model's schema enforcement fails anyway?

API-level schema enforcement (OpenAI Structured Outputs, vLLM guided decoding) has a very low failure rate but is not zero — edge cases in very complex schemas can still produce invalid output. Always wrap your parsing in a try/except and have a fallback. The fallback can be a retry with a simpler schema, a human review queue, or a graceful degradation that logs the failure and skips the record.

Can I use structured output with vision models?

Yes. OpenAI's gpt-4o with Structured Outputs supports image inputs alongside schema enforcement. Pass the image in the content array as a image_url block, then specify response_format as usual. This works well for extracting data from receipts, forms, screenshots, and diagrams. Claude's vision models support tool use, so the tool-as-schema pattern works with image inputs on Anthropic's API too.

How do I handle schemas that vary based on input?

Use a discriminated union pattern. Define a base schema with a type field, then define per-type schemas. With Pydantic, Annotated[Union[TypeA, TypeB], Field(discriminator="type")] generates the right JSON Schema. With OpenAI Structured Outputs, pass the discriminated union schema. The model reads the schema, infers which variant applies from the input, and populates the right fields. Avoid asking the model to infer the type internally — explicit discriminator fields produce far more reliable results.

Is tool use for structured output "cheating"?

No — it is the intended design. Anthropic's documentation explicitly recommends using tool use for structured output extraction, even when you have no actual tool to call. The model treats tool arguments as a strongly-typed output channel, which is exactly what you want. The distinction between "real" tool calls and "fake" schema-extraction tool calls is invisible to the model and irrelevant to reliability.