Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Improve vLLM Semantic Router accuracy with fine-tuning

Closing the accuracy gap

June 2, 2026
Christopher Nuland
Related topics:
AI inferenceArtificial intelligenceData science
Related products:
Red Hat OpenShift AI

    The vLLM Semantic Router solves a real problem. Not every request needs the same model. Some are simple and deterministic. Others require multi-step reasoning, tool use, or long context windows. If everything is sent to your largest model, you burn compute, increase latency, and lose efficiency across the entire system.

    So we introduce a routing layer. As detailed in our guide on getting started with the vLLM Semantic Router Athena release, the router classifies incoming requests and sends them to the appropriate model. Simple queries go to lightweight models that are fast and inexpensive. More complex requests are routed to reasoning-capable models that can handle deeper workloads. On paper, this works well. In small deployments or within your own home environment, it often will be a clear win.

    In enterprise deployments, however, a performance gap emerges. The router is typically built on a pretrained embedding model, something like all-MiniLM-L6-v2. It compares incoming prompts to anchor examples and selects the closest match based on similarity. That sounds reasonable until you actually measure it under realistic workloads.

    In our testing, the pretrained model achieved 80% accuracy on a four-tier classification task. That translates to a 20% misrouting rate. One in five requests is sent to the wrong model. That is not a tuning issue. That is a system-level limitation.

    Why high availability does not guarantee routing correctness

    We spend a lot of time talking about reliability in distributed systems. We design for high availability. We replicate services across zones. We scale horizontally and build in redundancy.

    The vLLM Semantic Router fits cleanly into that model. It can be deployed as a stateless service, replicated, and scaled like any other control plane component. But there is a disconnect. A system can be highly available and still be consistently wrong. If the router is returning incorrect decisions 20% of the time, uptime does not tell the full story. The system is available, but it is not reliable in the way that matters.

    For routing, correctness is a key service level objective (SLO). From this perspective, the problem becomes much clearer. The architecture is sound, but the decision layer is not accurate enough to support it.

    Where the pretrained model fails

    The failure mode is consistent and predictable. It is not random noise. It is structural. The pretrained model does not understand task complexity. It understands semantic similarity. It groups prompts based on how they look and read, not based on what they require to execute.

    That leads to a specific pattern of misclassification:

    • Longer prompts appear more complex.
    • More descriptive language appears more complex.
    • Structured instructions appear more complex.

    As a result, the router frequently misclassifies MEDIUM queries into the COMPLEX tier because they contain more detail or context, not because they require advanced reasoning. This misrouting increases operation costs by sending requests to larger models unnecessarily. Consequently, routing decisions no longer align with the actual workload complexity, which makes system behavior variable. Over time, this erodes trust in the routing layer.

    Setting up the fine-tuning pipeline on OpenShift AI

    To run this experiment, we built a simple but repeatable pipeline on Red Hat OpenShift AI that combined synthetic data generation, model training, and evaluation into a single workflow. We started with a small set of seed anchors, which are handwritten example prompts that represent each routing tier (SIMPLE, MEDIUM, COMPLEX, REASONING). These anchors serve as the ground-truth starting point, encoding how we expect different types of requests to be classified.

    From there, we expanded them using a synthetic data generation (SDG) pipeline driven by large models served through vLLM and KServe. That pipeline produced paraphrases, domain transfers, boundary cases, and hard negatives, then applied filtering and cross-model validation to maintain label quality.

    The resulting dataset was fed into a sentence-transformers training job using contrastive learning, running entirely on a CPU to keep it lightweight and accessible. From there, we evaluated the fine-tuned model against a held-out test set and measured routing accuracy, tier-level performance, and calibration.

    The output is a standard embedding model that can be dropped directly into the semantic router, making the entire process straightforward to reproduce and easy to integrate into an existing MLOps pipeline.

    What was tested

    This experiment evaluates whether an embedding model can learn specific routing decisions instead of general semantic similarity. We did not change the router. We did not introduce a new architecture or additional inference components. We focused on the component that makes the routing decision: the embedding model. The goal was to teach the difference between tiers, not just the similarity between sentences.

    As mentioned, we did not have labeled production routing data, so we generated it. Starting with 48 seed anchors, we expanded the dataset using a mix of strategies to capture both clean examples and complex edge cases.

    The expanded dataset included paraphrases, domain transfers, boundary cases, hard negatives, realistic prompt patterns, and noisy inputs like typos and truncation. Rather than building a massive dataset, we focused on accurately reflecting the decisions the router needs to make.

    StageCount
    Seed anchors48
    Generated examples1,157
    After filtering1,009
    Training set805
    Test set204

    The most important part of this dataset was not the volume but the ambiguity.

    Hard negatives showed about 50% disagreement between models during validation. This variance indicates that these examples sit directly on the boundary between tiers, which is where routing decisions matter most.

    Training the right behavior

    We kept the same base model to ensure compatibility with existing deployments. The change was in how the model learned. Instead of training for general similarity, we trained the model to distinguish between tiers using BatchAllTripletLoss with GROUP_BY_LABEL sampling. This forces every batch to include examples from all tiers and ensures the model learns the boundaries between them. Rather than passively learning clusters of similar sentences, the model actively learns what separates one tier from another.

    This change significantly improved the routing layer's accuracy, as shown in the following table.

    MetricBaselineFine-tunedDelta
    Accuracy80.39%98.53%+18.14%
    Misrouting rate19.61%1.47%-18.14%

    The system moved from one in five requests being misrouted to roughly 1 in 70.

    This shift transforms a system requiring constant correction into one that operates autonomously.

    Accuracy gains at tier boundaries

    The performance gains are concentrated at the tier boundaries, which is exactly where they need to be.

    The fine-tuned model demonstrated high precision and recall across all categories, as shown in the following table.

    TierPrecisionRecallF1
    SIMPLE1.00001.00001.0000
    MEDIUM0.96490.98210.9735
    COMPLEX0.98250.96550.9739
    REASONING1.00001.00001.0000

    The SIMPLE and REASONING tiers are cleanly separated. The remaining errors occur between the MEDIUM and COMPLEX tiers, where the inputs are genuinely ambiguous. At that point, the model is no longer making obvious mistakes. It is making judgment calls. That is where you want to be.

    Why this matters

    Routing is often described as a cost optimization layer. However, controlling costs is only one aspect of routing. Routing is a control plane decision. It determines which model processes the request, where data is sent, what capabilities are applied, and how policies are enforced across the system. When a router misclassifies a request, all downstream components inherit that error. Improving routing accuracy is not just about saving money. It is about making the entire system behave predictably.

    The baseline model consistently over-routes requests to expensive models.

    AI sovereignty and compliance

    This is where routing accuracy becomes critical. In regulated environments, routing is part of the security model. Sending a request to the wrong model does more than degrade efficiency; it can violate corporate compliance policies, data residency requirements, or internal controls. At a 20% misrouting rate, semantic routing is difficult to justify.

    Even with guardrails in place, the system is compensating for too many mistakes. Guardrails should always exist. They enforce policy, protect data, and provide a safety layer for unexpected behavior. But they are not meant to carry the system. When routing accuracy is low, guardrails become overloaded. They spend most of their time catching traffic that should have been handled correctly in the first place.

    As routing improves, that dynamic changes. At a misrouting rate below 2%, the router becomes the primary decision layer. Guardrails return to focusing on edge cases and enforcing policy boundaries. That shift reduces operational overhead and makes the system easier to reason about from a risk perspective.

    Domain-specific routing becomes viable

    Pretrained models cannot understand domain-specific complexity. They cannot distinguish between a simple lookup and a multi-step reasoning task within a specific field. Fine-tuning allows organizations to encode that knowledge directly into the routing layer. The workflow is straightforward. Define anchors that reflect your domain. Generate examples that expand those anchors. Train the model. Deploy it as a drop-in replacement. No changes to the routing architecture are required. This makes domain-specific routing practical without introducing additional system complexity.

    What this tells us

    The key takeaway is not just that fine-tuning improves accuracy. It is how much improvement can be achieved with a relatively small dataset and modest computing. With 805 training examples and under two hours of compute time, the pipeline lowers the barrier significantly. This is no longer a large-scale machine learning project. It is something platform teams can integrate into their existing workflows.

    The next step is to close the feedback loop. This process involves collecting real routing decisions, evaluating them using advanced judge models or human review, and feeding those results back into training. Over time, the router becomes adaptive. It evolves alongside the workloads it supports.

    Final thoughts

    Semantic routing is already the right architectural pattern. The limitation has been accuracy. This experiment shows that the accuracy ceiling is not fixed. Once that constraint is removed, the router becomes more than just a cost-optimization layer. It becomes a control plane for AI systems—not just deciding how inference is executed, but deciding which intelligence is applied in the first place.

    Additional material

    You can access the complete SDG pipeline, training code, evaluation framework, and experiment data through these repositories:

    • Fine-tuning pipeline and experiment details
    • vLLM Semantic Router

    This experiment relies on the open source projects listed in the following table:

    ProjectVersionRole
    sentence-transformers5.4.1Embedding model training and inference
    PyTorch2.11.0Deep learning framework (CPU build)
    transformers5.6.2Model architecture and tokenizers
    datasets4.8.4Dataset loading and processing
    scikit-learn1.8.0Evaluation metrics
    vLLM LLM serving (KServe backend)
    all-MiniLM-L6-v2 Base embedding model (23M params)

    Related Posts

    • vLLM Semantic Router: Improving efficiency in AI reasoning

    • Getting started with the vLLM Semantic Router project's Athena release: Optimize your tokens for agentic AI

    • Performance improvements with speculative decoding in vLLM for gpt-oss

    • LLM Semantic Router: Intelligent request routing for large language models

    • Practical strategies for vLLM performance tuning

    • 5 steps to triage vLLM performance

    Recent Posts

    • UBI 9 and 10 builders on Paketo Buildpacks with multi-arch support

    • Deploy Hermes Agent on OpenShift AI with vLLM model serving

    • Evaluation-driven development with EvalHub

    • Improve vLLM Semantic Router accuracy with fine-tuning

    • Red Hat build of Cryostat 4.2: Enhanced Java monitoring for OpenShift

    What’s up next?

    Learning Path automation-pipeline-lp-feature-image

    Automate ML pipelines with OpenShift AI

    Dive into the end-to-end process of building and managing machine learning...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.