If you are running agentic workloads in production, you want an API layer between your application and the inference engine for swapping providers without rewriting code, managing multi-turn conversation state, and observability. Every request passes through it as follows:
Client -> API layer -> inference server -> API layer -> client
You would expect this layer to pass everything through faithfully. But it can pass tool schemas incorrectly, handle fields differently across versions, or drop state that carries conversation context. When that happens, the final prompt the model sees differs from the format on which it was trained. In multi-turn tool calling, the client takes the response, updates conversation history, and sends it back through the API layer for the next turn. If information is silently lost, there are no failing tests, just quiet accuracy loss. This is one reason performance reports for the same agentic model vary so widely across different deployments.
The only way to catch this is to run an end-to-end benchmark like Berkeley Function-Calling Leaderboard, BFCL across your full stack and compare the numbers against direct calls to the inference provider.
Our stack: OGX and vLLM
OGX is an open source API server that provides a single OpenAI-compatible interface across multiple inference providers, including vLLM, Ollama, and Bedrock. Red Hat OpenShift AI ships both together.
We ran BFCL for gpt-oss, OpenAI's open-weight model family, on this stack across OpenShift AI 3.3 and 3.4. The surprise was not that 3.4 scored better, but where the gain came from. Upgrading OGX without vLLM actually made things worse, a regression that would have been invisible without the benchmark.
The gpt-oss-120b on OpenShift AI 3.4 scored 51.4% on BFCL multi-turn, up from 44.8% on 3.3: a 6.6 percentage point gain in tool-calling accuracy across multi-step conversations.
OpenShift AI tests OGX and vLLM together before shipping, so a version bump in one does not silently break the other. That means your team can focus on building applications instead of debugging invisible regressions between infrastructure components. When the next round of model and infrastructure improvements ships, you get those gains without requalifying each piece. If you run your own stack with different layers in between, the same lesson applies: test end-to-end.
The source of the gain
Among many changes in OpenShift AI 3.4, the two driving accuracy gains in these experiments are vLLM (v0.13.0 to v0.18.0) and OGX (v0.4.2 to v0.7.1). The BFCL tests run through OGX's responses API, with the BFCL harness managing the multi-turn tool-calling loop client-side.
We tested every combination of old and new as shown in the following table:
Configuration | Overall Accuracy |
|---|---|
OpenShift AI 3.3 OGX + RHOAI 3.3 vLLM | 44.8% |
OpenShift AI 3.3 OGX + RHOAI 3.4 vLLM | 46.0% |
OpenShift AI 3.4 OGX + RHOAI 3.3 vLLM | 43.6% |
OpenShift AI 3.4 OGX + RHOAI 3.4 vLLM | 51.4% |
Upgrading vLLM alone gave a small bump. Upgrading OGX alone actually regressed. Both together jumped to 51.4%. The pieces have to move together.
Full results, including gpt-oss-20b and vLLM-direct baselines, are in the benchmark report.
Reproduce the results
The benchmark report has step-by-step instructions to replicate the results, including OGX and vLLM versions, BFCL test setup, and evaluation commands. The gpt-oss-120b uses MXFP4 quantization and fits on a single 80GB GPU, but for production serving we recommend a multi-GPU setup such as 4x NVIDIA H100 with NVLink. We ran our tests on an AWS g6e.12xlarge (4x NVIDIA L40S). Review the vLLM gpt-oss recipe for detailed hardware guidance.
To get started with OpenShift AI 3.4, see the Red Hat OpenShift AI documentation. The BFCL benchmark is open source if you want to run it against your own stack.