The vLLM Semantic Router solves a real problem. Not every request needs the same model. Some are simple and deterministic. Others require multi-step reasoning, tool use, or long context windows. If everything is sent to your largest model, you burn compute, increase latency, and lose efficiency across the entire system.
So we introduce a routing layer. As detailed in our guide on getting started with the vLLM Semantic Router Athena release, the router classifies incoming requests and sends them to the appropriate model. Simple queries go to lightweight models that are fast and inexpensive. More complex requests are routed to reasoning-capable models that can handle deeper workloads. On paper, this works well. In small deployments or within your own home environment, it often will be a clear win.
In enterprise deployments, however, a performance gap emerges. The router is typically built on a pretrained embedding model, something like all-MiniLM-L6-v2. It compares incoming prompts to anchor examples and selects the closest match based on similarity. That sounds reasonable until you actually measure it under realistic workloads.
In our testing, the pretrained model achieved 80% accuracy on a four-tier classification task. That translates to a 20% misrouting rate. One in five requests is sent to the wrong model. That is not a tuning issue. That is a system-level limitation.
Why high availability does not guarantee routing correctness
We spend a lot of time talking about reliability in distributed systems. We design for high availability. We replicate services across zones. We scale horizontally and build in redundancy.
The vLLM Semantic Router fits cleanly into that model. It can be deployed as a stateless service, replicated, and scaled like any other control plane component. But there is a disconnect. A system can be highly available and still be consistently wrong. If the router is returning incorrect decisions 20% of the time, uptime does not tell the full story. The system is available, but it is not reliable in the way that matters.
For routing, correctness is a key service level objective (SLO). From this perspective, the problem becomes much clearer. The architecture is sound, but the decision layer is not accurate enough to support it.
Where the pretrained model fails
The failure mode is consistent and predictable. It is not random noise. It is structural. The pretrained model does not understand task complexity. It understands semantic similarity. It groups prompts based on how they look and read, not based on what they require to execute.
That leads to a specific pattern of misclassification:
- Longer prompts appear more complex.
- More descriptive language appears more complex.
- Structured instructions appear more complex.
As a result, the router frequently misclassifies MEDIUM queries into the COMPLEX tier because they contain more detail or context, not because they require advanced reasoning. This misrouting increases operation costs by sending requests to larger models unnecessarily. Consequently, routing decisions no longer align with the actual workload complexity, which makes system behavior variable. Over time, this erodes trust in the routing layer.
Setting up the fine-tuning pipeline on OpenShift AI
To run this experiment, we built a simple but repeatable pipeline on Red Hat OpenShift AI that combined synthetic data generation, model training, and evaluation into a single workflow. We started with a small set of seed anchors, which are handwritten example prompts that represent each routing tier (SIMPLE, MEDIUM, COMPLEX, REASONING). These anchors serve as the ground-truth starting point, encoding how we expect different types of requests to be classified.
From there, we expanded them using a synthetic data generation (SDG) pipeline driven by large models served through vLLM and KServe. That pipeline produced paraphrases, domain transfers, boundary cases, and hard negatives, then applied filtering and cross-model validation to maintain label quality.
The resulting dataset was fed into a sentence-transformers training job using contrastive learning, running entirely on a CPU to keep it lightweight and accessible. From there, we evaluated the fine-tuned model against a held-out test set and measured routing accuracy, tier-level performance, and calibration.
The output is a standard embedding model that can be dropped directly into the semantic router, making the entire process straightforward to reproduce and easy to integrate into an existing MLOps pipeline.
What was tested
This experiment evaluates whether an embedding model can learn specific routing decisions instead of general semantic similarity. We did not change the router. We did not introduce a new architecture or additional inference components. We focused on the component that makes the routing decision: the embedding model. The goal was to teach the difference between tiers, not just the similarity between sentences.
As mentioned, we did not have labeled production routing data, so we generated it. Starting with 48 seed anchors, we expanded the dataset using a mix of strategies to capture both clean examples and complex edge cases.
The expanded dataset included paraphrases, domain transfers, boundary cases, hard negatives, realistic prompt patterns, and noisy inputs like typos and truncation. Rather than building a massive dataset, we focused on accurately reflecting the decisions the router needs to make.
| Stage | Count |
|---|---|
| Seed anchors | 48 |
| Generated examples | 1,157 |
| After filtering | 1,009 |
| Training set | 805 |
| Test set | 204 |
The most important part of this dataset was not the volume but the ambiguity.
Hard negatives showed about 50% disagreement between models during validation. This variance indicates that these examples sit directly on the boundary between tiers, which is where routing decisions matter most.
Training the right behavior
We kept the same base model to ensure compatibility with existing deployments. The change was in how the model learned. Instead of training for general similarity, we trained the model to distinguish between tiers using BatchAllTripletLoss with GROUP_BY_LABEL sampling. This forces every batch to include examples from all tiers and ensures the model learns the boundaries between them. Rather than passively learning clusters of similar sentences, the model actively learns what separates one tier from another.
This change significantly improved the routing layer's accuracy, as shown in the following table.
| Metric | Baseline | Fine-tuned | Delta |
|---|---|---|---|
| Accuracy | 80.39% | 98.53% | +18.14% |
| Misrouting rate | 19.61% | 1.47% | -18.14% |
The system moved from one in five requests being misrouted to roughly 1 in 70.
This shift transforms a system requiring constant correction into one that operates autonomously.
Accuracy gains at tier boundaries
The performance gains are concentrated at the tier boundaries, which is exactly where they need to be.
The fine-tuned model demonstrated high precision and recall across all categories, as shown in the following table.
| Tier | Precision | Recall | F1 |
|---|---|---|---|
SIMPLE | 1.0000 | 1.0000 | 1.0000 |
MEDIUM | 0.9649 | 0.9821 | 0.9735 |
COMPLEX | 0.9825 | 0.9655 | 0.9739 |
REASONING | 1.0000 | 1.0000 | 1.0000 |
The SIMPLE and REASONING tiers are cleanly separated. The remaining errors occur between the MEDIUM and COMPLEX tiers, where the inputs are genuinely ambiguous. At that point, the model is no longer making obvious mistakes. It is making judgment calls. That is where you want to be.
Why this matters
Routing is often described as a cost optimization layer. However, controlling costs is only one aspect of routing. Routing is a control plane decision. It determines which model processes the request, where data is sent, what capabilities are applied, and how policies are enforced across the system. When a router misclassifies a request, all downstream components inherit that error. Improving routing accuracy is not just about saving money. It is about making the entire system behave predictably.
The baseline model consistently over-routes requests to expensive models.
AI sovereignty and compliance
This is where routing accuracy becomes critical. In regulated environments, routing is part of the security model. Sending a request to the wrong model does more than degrade efficiency; it can violate corporate compliance policies, data residency requirements, or internal controls. At a 20% misrouting rate, semantic routing is difficult to justify.
Even with guardrails in place, the system is compensating for too many mistakes. Guardrails should always exist. They enforce policy, protect data, and provide a safety layer for unexpected behavior. But they are not meant to carry the system. When routing accuracy is low, guardrails become overloaded. They spend most of their time catching traffic that should have been handled correctly in the first place.
As routing improves, that dynamic changes. At a misrouting rate below 2%, the router becomes the primary decision layer. Guardrails return to focusing on edge cases and enforcing policy boundaries. That shift reduces operational overhead and makes the system easier to reason about from a risk perspective.
Domain-specific routing becomes viable
Pretrained models cannot understand domain-specific complexity. They cannot distinguish between a simple lookup and a multi-step reasoning task within a specific field. Fine-tuning allows organizations to encode that knowledge directly into the routing layer. The workflow is straightforward. Define anchors that reflect your domain. Generate examples that expand those anchors. Train the model. Deploy it as a drop-in replacement. No changes to the routing architecture are required. This makes domain-specific routing practical without introducing additional system complexity.
What this tells us
The key takeaway is not just that fine-tuning improves accuracy. It is how much improvement can be achieved with a relatively small dataset and modest computing. With 805 training examples and under two hours of compute time, the pipeline lowers the barrier significantly. This is no longer a large-scale machine learning project. It is something platform teams can integrate into their existing workflows.
The next step is to close the feedback loop. This process involves collecting real routing decisions, evaluating them using advanced judge models or human review, and feeding those results back into training. Over time, the router becomes adaptive. It evolves alongside the workloads it supports.
Final thoughts
Semantic routing is already the right architectural pattern. The limitation has been accuracy. This experiment shows that the accuracy ceiling is not fixed. Once that constraint is removed, the router becomes more than just a cost-optimization layer. It becomes a control plane for AI systems—not just deciding how inference is executed, but deciding which intelligence is applied in the first place.
Additional material
You can access the complete SDG pipeline, training code, evaluation framework, and experiment data through these repositories:
This experiment relies on the open source projects listed in the following table:
| Project | Version | Role |
|---|---|---|
| sentence-transformers | 5.4.1 | Embedding model training and inference |
| PyTorch | 2.11.0 | Deep learning framework (CPU build) |
| transformers | 5.6.2 | Model architecture and tokenizers |
| datasets | 4.8.4 | Dataset loading and processing |
| scikit-learn | 1.8.0 | Evaluation metrics |
| vLLM | LLM serving (KServe backend) | |
| all-MiniLM-L6-v2 | Base embedding model (23M params) |