Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Structured outputs in vLLM: Guiding AI responses

Enforce predictability without sacrificing performance

June 3, 2025
Michael Goin Russell Bryant Addie Stevens
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

    As large language models are increasingly embedded into applications, the ability to control and structure their output is no longer a luxury, it’s a necessity. Whether you're parsing LLM responses in production pipelines, enforcing specific output schemas for downstream tooling, or just ensuring predictable formatting, vLLM's updated structured output feature delivers a robust solution for constraining model responses.

    In this post, we’ll walk through what structured outputs in vLLM enable, how they work under the hood, and what kind of performance you can expect in practice. This feature, available as of vLLM 0.8.5, supports a wide range of output constraints, from simple choice lists to full JSON schemas, with minimal overhead and surprising flexibility.

    Why structured outputs matter

    Structured output support gives you the ability to constrain the output of a language model to a specific format. Instead of generating free-form text, the model is guided (and limited) to return only valid outputs according to user-defined rules.

    This is crucial for applications where models are used as part of a pipeline or system. For instance, you might expect a model to output a color, a date, a JSON object, or even a tool call that conforms to a particular structure. Without constraints, LLMs may “hallucinate” or provide overly verbose or ambiguous results that require expensive post-processing or error handling.

    With structured outputs, vLLM effectively becomes the “format police,” enforcing output conformity at generation time rather than as an afterthought.

    Use cases and examples

    Below are several practical demonstrations of how these constraints can be implemented and what results to expect.

    Choice constraints

    The simplest use case is classification. Suppose you want your model to output one of: "red", "blue", or "green". Without constraints, you might get:

    “While I don't see color, I think green is a lovely option.”

    That’s not helpful if your code expects just the word "green." With structured outputs, you pass an explicit list of allowed values, and vLLM guarantees the result is one of them.

    extra_body = {
        "guided_choice": ["red", "blue", "green"]
    }

    JSON schema enforcement

    For more complex structures, you can define a JSON schema. It's a powerful way to enforce fields, types, and even nested properties.

    Without this, a model might return nearly-correct JSON that fails to parse (e.g., with embedded comments or trailing commas). With schema-based enforcement, vLLM guarantees syntactically and semantically valid JSON.

    {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "is_student": {"type": "boolean"}
      },
      "required": ["name", "age", "is_student"]
    }

    Regex and grammar support

    For use cases requiring more customized formatting, such as dates or identifiers, vLLM supports regular expressions and grammars. For example:

    extra_body = {
        "guided_regex": "\\d{4}-\\d{2}-\\d{2}"
    }

    Or you can define grammars for use cases like generating SQL queries or specific command patterns, depending on the back end you're using (more on this later).

    Structural tags for partial constraints

    Structural tags allow you to enforce schema constraints on just part of the output. For instance, the model can generate free-form natural language, then switch into a structured tool call, and then back to free-form.

    This is particularly powerful for applications involving tool use or interleaved output formats, and it’s a major step toward more advanced interaction patterns in LLM-based systems.

    Under the hood: How it works

    Let's take a look at how vLLM enforces structured outputs during the generation process.

    The mental model

    At generation time, a language model produces probabilities for possible next tokens. Structured output constrains this by masking invalid tokens, ensuring only tokens that comply with the defined constraints remain candidates for sampling.

    This happens dynamically, on a per-token basis. The constraints evolve as output is generated. For example, in a JSON schema, what’s valid after { changes as each field is emitted. A state tracker within vLLM keeps tabs on context and valid token ranges, updating masks accordingly.

    Code integration and back ends

    vLLM integrates structured output support deeply across its inference pipeline:

    • Structured Output Module: Lives under vllm/v1/structured_output, coordinating constraint handling.
    • Back ends:
      • XGrammar (https://github.com/mlc-ai/xgrammar): Optimized for cases where caching structured formats upfront is beneficial.
      • Guidance (https://github.com/guidance-ai/llguidance): Calculates constraints on a per-token basis with fast time-to-first-token.
    • Scheduler: Tracks state and generates bitmasks based on valid tokens.
    • Model Runner: Applies constraints in back end-specific GPU/TPU code.

    There’s also an in-progress back end using Outlines Core, which will offer additional capabilities in the future.

    Performance benchmarks

    Structured output support in vLLM V1 is dramatically faster than in V0. In V0, even a single constrained request could degrade system-wide performance. In contrast, V1 introduces minimal overhead, thanks to back-end optimizations and smarter architecture. See Figure 1. 

    Structured output initialization
    Figure 1: Structured output initialization is non-blocking in vLLM V1, unlike V0 where it stalled the entire engine.

    Test 1: Cached JSON schemas

    • Dataset: Reused a small set of JSON schemas (< 100).
    • Result: Time-per-output-token was only marginally higher for structured output vs. unconstrained.
    • XGrammar slightly outperformed Guidance due to effective caching.

    Test 2: Unique JSON schemas

    • Dataset: Each request used a completely unique schema to disable caching.
    • Result: Guidance had faster time-to-first-token; XGrammar benefited from multithreading tweaks, though over-threading could degrade performance.

    Summary of back-end trade-offs

    Back endStrengthsBest use cases
    XGrammarCaches well, excels at long generationsRepeated schemas, long outputs
    GuidanceLower latency per request, better in unpredictable setupsMulti-tenant, dynamic schemas

    By default, vLLM uses auto mode to choose the best guided decoding back end based on the request. This behavior evolves over time as performance optimizations are added. The xgrammar back end offers low time per output token, making it ideal for longer generations. It performs best when grammars are reused, thanks to effective caching. The guidance backend excels at fast time to first token, even with complex grammars. While its output token speed is slightly slower, it’s well suited for dynamic or multi-tenant workloads.

    Most users can rely on the default auto setting, which intelligently picks the optimal back end.

    What’s next: Jump decoding and beyond

    One exciting optimization in development is jump decoding. When the model is constrained to a known sequence (e.g., structural JSON), vLLM can skip ahead by avoiding unnecessary token sampling and GPU computation. 

    For example, if output must be:

    { "name": "Alice" }

    Once { is chosen, the next token must be ", then name, and so on. No need to sample each step.

    This can significantly accelerate generation and reduce GPU load, especially when output formats are strict and predictable.

    Other upcoming enhancements include:

    • Deeper integration into tool calling workflows.
    • Expanded grammar and back-end support.
    • Ongoing optimizations to improve performance across edge cases.

    Getting started

    To use structured outputs in vLLM, add a single field to your API request:

    • OpenAI-compatible server: Add guided_choice, guided_regex, guided_json, or guided_grammar to the body of your payload.
    • Python API: Include constraints under SamplingParams.guided_decoding.

    Documentation and examples are available in vLLM's structured output docs, covering choice lists, JSON schemas, regex, grammars, and hybrid formats.

    Last updated: June 4, 2025

    Related Posts

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • How we optimized vLLM for DeepSeek-R1

    • How RamaLama runs AI models in isolation by default

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • vLLM V1: Accelerating multimodal inference for large language models

    • Deployment-ready reasoning with quantized DeepSeek-R1 models

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.