Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Deploy an LLM inference service on OpenShift AI

Powering the Ansible Lightspeed intelligent assistant with Red Hat OpenShift AI, part 2

November 3, 2025
Riya Sharma Elijah DeLee
Related topics:
Artificial intelligenceAutomation and management
Related products:
Red Hat AIRed Hat OpenShift AIRed Hat Ansible Automation PlatformRed Hat Ansible Lightspeed with IBM watsonx Code AssistantRed Hat OpenShift

    Deploying large language models (LLMs) on Red Hat OpenShift AI enables on-premise inference for Red Hat Ansible Lightspeed intelligent assistant. With OpenShift AI, you can containerize, scale, and integrate LLM workloads directly into enterprise environments. This ensures better control over data, compliance with organizational policies, and the ability to optimize resource utilization. The deployment also creates a flexible platform for evaluating model performance, customizing workflows, and extending Ansible Lightspeed with domain-specific intelligence.

    Note: Installing OpenShift AI is a prerequisite for this model deployment.

    Powering the Ansible Lightspeed intelligent assistant with your own inference service

    Ansible Lightspeed intelligent assistant lets you bring your own AI service to power the inference that helps generate answers. These answers are based on the enhanced context from the retrieval-augmented generation (RAG) requests that the intelligent assistant uses.

    The core of these AI services is typically an LLM. An LLM is a type of AI model that uses natural language processing (NLP) to understand and generate human-like text. Trained on massive amounts of text data using transformer architectures, these models learn the deep patterns and nuances of language.

    OpenShift AI provides the necessary interface for this is through an inference service. An inference service is a standardized API endpoint that provides access to a deployed LLM, allowing applications like the intelligent assistant to simply send its RAG-enhanced prompt and receive the generated response.

    Key components of an inference server deployment

    Deploying an inference service involves several critical components:

    • Model storage: KServe supports Amazon S3, PVC, or OCI-based model storage. We have chosen an S3-compatible location where the large LLM files are stored. The service pulls the model from here on startup.
    • A serving runtime: A specialized container image that contains the necessary libraries and server software (like vLLM) to load the model from storage and serve requests.
    • Accelerated compute infrastructure: An underlying worker node equipped with a compatible GPU. We have chosen instances with NVIDIA GPUs, so we need to install the NVIDIA GPU Operator in OpenShift to make this hardware's processing power available to our serving runtime. This is essential for performant inference. Alternatively, OpenShift AI supports AMD GPUs, but we will not cover this in our example.

    Evaluating inference server performance

    How do we know if our inference service is good enough for a real-time, interactive tool like the Ansible Lightspeed intelligent assistant? A slow, lagging assistant provides a poor user experience. To quantify performance, we focus on two key performance indicators (KPIs):

    • Time to First Token (TTFT): This measures the latency from when a user sends a prompt to when the very first piece (token) of the response is generated. A low TTFT is crucial for making the assistant feel responsive and not "stuck."
    • Inter-Token Latency (ITL): This measures the average time delay between each subsequent token in the response. A low ITL ensures the response is streamed smoothly and quickly, rather than appearing in slow, jerky bursts.

    To ensure a quality user experience, we must define service-level objectives (SLOs) for these KPIs. For the purpose of this guide, we'll establish the following SLOs:

    • 99th percentile TTFT: Fewer than 1,500 milliseconds.
    • 99th percentile ITL: Fewer than 200 milliseconds.

    Deployment overview

    This blog post guides you through the following high-level steps:

    1. Configure storage: We will deploy MinIO to create an on-cluster, S3-compatible bucket for our model files.
    2. Prepare the serving runtime: We will build a custom vLLM container image compatible with our GPU infrastructure.
    3. Deploy the model: Using the OpenShift AI interface, we will create the InferenceService, connecting our storage and runtime to launch a live API endpoint.
    4. Benchmark and validate: We will use the guidellm tool to test our service against the SLOs we defined, ensuring it's ready for integration.

    Successfully completing these steps will set the stage perfectly for the next blog post in this series, where we will connect this proven, performant inference service to the Ansible Lightspeed intelligent assistant and measure the end-to-end user experience.

    Before you begin

    We recommend having an OpenShift user with cluster-admin privileges. Refer to this article for more details.

    Serving a model

    This section walks through storage setup, serving runtime preparation, and deployment.

    Set up storage for the model with MinIO

    Setting up the storage for handling the LLM model files is required, as OpenShift doesn't support large-file storage. One option is to connect an external S3-compatible service such as MinIO or a public cloud S3 provider. I went with MinIO because it offers a practical way to set up, manage, and upload the model within the cluster.

    1. Deploy the MinIO instance in the given namespace using the following file: minio-setup.yaml

      Either clone the repo or copy the file to your local system.

    2. Check the current project in OpenShift using the following CLI command:

      oc project
    3. If needed, switch to the desired namespace using the following command:

      oc project <namespace-name>
    4. Run the following command to deploy a pod with the given configurations in the file:

      oc apply -f minio-setup.yaml
    5. Note the minio_root_user and minio_root_password and store them securely. The username and password are required to access the MinIO user dashboard.
    6. After setting up MinIO, you can discover routes to access it in the OpenShift dashboard.
      1. From the OpenShift homepage, navigate to Networking → Routes, which has routes to minio-api and minio-ui.
      2. Use the minio-ui route to access the Minio UI (Figure 1) and the minio_root_user and minio_root_password from step 5 to log in.
    Screenshot of the MinIO Object Browser showing a private models storage bucket containing files for the CodeLlama and granite models.
    Figure 1: The MinIO dashboard.

    With that, our storage for models is ready for use.

    Disabling SSL verification

    A possible obstacle is that if the MinIO Route is deployed with the default self-signed certificate, KServe (the underlying model serving platform) will fail to pull the model because it cannot verify the SSL certificate. The KServe InferenceService does not expose a simple verify_ssl: false option.

    As a workaround, you must manually modify the data connection secret to disable SSL verification. Follow these steps:

    1. Create the initial connection for MinIO in the OpenShift AI dashboard.
    2. Create the InferenceService that will use this connection. This triggers an operator to create a secret named storage-config in your project namespace.
    3. Follow the steps outlined in the Red Hat Knowledgebase article, Disable SSL verification for a data connection.This process involves adding an annotation to the secret to prevent the operator from overwriting your changes and then updating the secret's data to include the SSL verification flag.

    Note

    If you have multiple data connections, you might need to manually base64 decode the secret's contents, edit the specific connection, and re-encode it, as the sed command in the article might not work correctly for multiple entries.

    Upload the model to MinIO

    Before moving ahead with uploading the model, you will need to download it from the Hugging Face library. LLM model sizes are generally large, so you can't download them directly. To download the model, you can either enable Git LFS (Large File Storage) or use the Hugging Face CLI, which aids in downloading the large files. In this setup, I used Git with Git LFS enabled to download the Llama model.

    Once Git LFS is installed, downloading the model is just like cloning any repository:

    git clone https://huggingface.co/<organization>/<repository_name>
    Example: git clone https://huggingface.co/meta-llama/Llama-3.1-8B

    Upload the downloaded model to MinIO using the MinIO UI Upload button in the dashboard, shown in Figure 2.

    The Upload button pictured in the MinIO UI.
    Figure 2: MinIO upload option.

    Ensure you have GPU nodes on your cluster

    For the inference service, we need to make sure we have GPU nodes available on our cluster. Follow this guide to create a GPU-enabled node with OpenShift 4.2 in Amazon EC2.

    Install and configure the Node Feature Discovery operator

    Just adding a worker node with a GPU is not enough to use the GPU-enabled nodes for model serving. The Node Feature Discovery (NFD) operator in OpenShift automates the process of detecting hardware features and system configurations within an OpenShift cluster. Its primary function is to label nodes with hardware-specific information. It helps us label our GPU node and manage resources for workloads.

    To install the Node Feature Discovery operator, follow the Red Hat documentation guidance.

    To verify and configure NFD, follow this reference documentation.

    Install and configure the NVIDIA GPU operator

    The Node Feature Discovery operator helps properly discover and label the capabilities of nodes, but nodes still require additional configuration to be able to run GPU-accelerated workloads. This is where we use the NVIDIA GPU operator for OpenShift to automatically deploy, configure, and manage the NVIDIA software stack needed to run GPU-accelerated workloads.

    1. From the OpenShift homepage, go to Operators → OperatorHub. You can select the namespace where you want to deploy the GPU operator; we suggest nvidia-gpu-operator.
    2. Search for the NVIDIA GPU Operator. Select the operator and click Install.
    3. To verify and configure the NVIDIA GPU operator, follow the instructions in the reference documentation.

    Deploy and debug the InferenceService

    Once you've configured the storage and the Node Feature Discovery and NVIDIA GPU operators are in place, you can now deploy the model in a ServingRuntime on the GPU node we provisioned.

    1. Log in to the OpenShift AI dashboard as the new clusteradmin user (Figure 3).

      Screenshot of the Red Hat OpenShift AI Model deployments page, showing two models deployed: CodeLlama and x86-granite.
      Figure 3: OpenShift AI dashboard.
    2. From the left-hand menu, select Models → Model deployments (Figure 4).

      The Model deployments option shown in the OpenShift AI user interface.
      Figure 4: Navigate to the Model deployments overview in OpenShift AI.
    3. Select the project in which you want to deploy the model.
    4. Click Deploy model (Figure 5).

      The Deploy model button is shown at the top of the Model deployments page, with the lightspeed-aap project selected.
      Figure 5: Deploy model option available.
    5. Fill in the required details (Figure 6):

      1. Add the serving runtime as vLLM NVIDIA GPU ServingRuntime for KServe.
      2. In the Connection drop-down menu (Figure 7), locate the MinIO setup and select that.
      Configuring the model deployment properties, including name, serving runtime, and model framework.
      Figure 6: Configure the model deployment properties: name, serving runtime, model framework, and more.
      Connections drop-down menu showing MinIO setup selected.
      Figure 7: Select the minio connection when configuring the source model location.
    6. Click Deploy to create the InferenceService, pointing to the model artifacts in the MinIO storage.

    Create an external route

    To interact with the service, you need to create an external route.

    1. From the OpenShift, select Networking → Routes → Create Route.
    2. Select the Form view option and fill the details accordingly (Figure 8).
      1. Name: llama
      2. Service: llamas-predictor
      3. Target port: 80 → 8080 (TCP)
      4. Secure Route: Enabled
      5. TLS termination: Edge
      6. Insecure traffic: Redirect
    OpenShift web console Create Route form filled with name llama, service llamas-predictor, target port 8080 (TCP), and Secure Route enabled with Edge termination.
    Figure 8: Add the External Route details.

    Evaluating our inference server's performance

    GuideLLM is a tool for analyzing and evaluating LLM deployments. By simulating real-world inference workloads, GuideLLM enables users to assess the performance, resource requirements, and cost implications of deploying LLMs on various hardware configurations. It is Python-based and can be easily installed with pip:

    pip install guidellm

    GuideLLM offers a rich set of command-line options, enabling users to customize prompt and output lengths in terms of tokens for precise control over benchmarking scenarios. One particularly useful feature is its ability to begin with synchronous requests and gradually scale up to maximum concurrency. This helps you evaluate the inference server's performance under varying, stepwise throughput loads.

    Here is the command I used to kick off the benchmarking in my setup:

    guidellm benchmark run \
    --target "http://localhost:8000" \
    --backend-args '{"verify": false}' \
    --processor ./local_tokenizer \
    --data='{"prompt_tokens": 128, "output_tokens": 2000}' \
    --rate-type synchronous \
    --max-seconds 200

    Results

    We ran a few tests and found that the p99 (99th percentile) Time to First Token (TTFT) was fewer than 1,000 milliseconds and that the p99 of the Inter-Token Latency (ITL) was fewer than 52 milliseconds, which met our SLOs for these KPIs (fewer than 1,500 and 200 milliseconds, respectively). Here are the detailed results:

    "metadata_benchmark": {
    	"synchronous1": {
    		"reques_stats": {
    			"per_second": 0.13,
    "concurrency": 1,
    },
    "output_tokpersec": {
    	"mean": 17.2,
    },
    "tot_tokpersec": {
    	"mean": 34.6,
    },
    "req_latency": {
    	"mean": 7.41,
    	"median": 7.41,
    	"p99": 7.42,
    },
    "ttft_in_ms": {
    	"mean": 860,
    	"median": 866,
    	"p99": 909,
    },
    "itl_in_ms": {
    	"mean": 51.6,
    	"median": 51.8,
    	"p99": 51.8,
    },
    "tpot_in_ms": {
    	"mean": 51.2,
    	"median": 51.2,
    	"p99": 51.4,
    },
    	},
    	"synchronous2": {
    		"reques_stats": {
    			"per_second": 0.14,
    "concurrency": 1,
    },
    "output_tokpersec": {
    	"mean": 17.3,
    },
    "tot_tokpersec": {
    	"mean": 34.7,
    },
    "req_latency": {
    	"mean": 7.39,
    	"median": 7.4,
    	"p99": 7.4,
    },
    "ttft_in_ms": {
    	"mean": 837.7,
    	"median": 838.4,
    	"p99": 854.6,
    },
    "itl_in_ms": {
    	"mean": 51.6,
    	"median": 51.6,
    	"p99": 52,
    },
    "tpot_in_ms": {
    	"mean": 51.2,
    	"median": 51.2,
    	"p99": 51.6,
    },
    	},
    	"synchronous3": {
    		"reques_stats": {
    			"per_second": 0.13,
    "concurrency": 1,
    },
    "output_tokpersec": {
    	"mean": 17.2,
    },
    "tot_tokpersec": {
    	"mean": 34.6,
    },
    "req_latency": {
    	"mean": 7.41,
    	"median": 7.38,
    	"p99": 7.56,
    },
    "ttft_in_ms": {
    	"mean": 861.2,
    	"median": 842.2,
    	"p99": 999,
    },
    "itl_in_ms": {
    	"mean": 51.6,
    	"median": 51.5,
    	"p99": 51.9,
    },
    "tpot_in_ms": {
    	"mean": 51.2,
    	"median": 51.1,
    	"p99": 51.5,
    },
    	},
    }

    Next steps

    In this article, we demonstrated how to set up storage for large language models. We then used it to deploy an LLM with OpenShift AI to provide an inference service. Finally, we evaluated the inference service's performance in terms of our key performance indicators.

    Further resources:

    • How to run vLLM on CPUs with OpenShift for GPU-free inference
    • Red Hat course: Developing and Deploying AI/ML Applications on Red Hat OpenShift AI
    • How to enable Ansible Lightspeed intelligent assistant
    • NVIDIA GPU Operator documentation
    • How to deploy an LLM on RedHat OpenShift on GitHub

    Related Posts

    • Autoscaling vLLM with OpenShift AI

    • How to deploy language models with Red Hat OpenShift AI

    • Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

    • From raw data to model serving with OpenShift AI

    • How to run performance and scale validation for OpenShift AI

    • How to install single node OpenShift on bare metal

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.