Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Introducing Models-as-a-Service in OpenShift AI

November 25, 2025
Dmytro Zaharnytskyi
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

    This article explains how to deploy and manage Models-as-a-Service (MaaS) on Red Hat OpenShift, now available in developer preview. We'll begin by discussing the benefits of MaaS, highlighting how it enables organizations to share AI models at scale. Then, we'll guide you through the process of setting it up on OpenShift, deploying a sample model, and demonstrating how rate limiting protects your resources.

    What is Models-as-a-Service (MaaS)?

    With Models-as-a-Service (MaaS), you can deliver AI models as shared resources that users within an organization can access on demand. MaaS provides a ready-to-go AI foundation using standardized API endpoints, enabling organizations to share and access private, faster AI at scale.

    Red Hat OpenShift AI already lets you run AI models by exposing them via APIs and sharing. When sharing models with a large user base, though, you might find it hard to maintain quality-of-service by limiting excessive usage. OpenShift AI 3 introduces the Model-as-a-Service pattern, using Red Hat's Connectivity Link capabilities. This gives OpenShift AI admins better control over model access and rate limiting.

    Quick setup

    Let's prepare your environment for the Models-as-a-Service deployment.

    Prerequisites

    Ensure you have the following components available:

    • An OpenShift cluster (4.19.9 or later)
    • Red Hat OpenShift AI Operator 3
    • Red Hat Connectivity Link 1.2
    • CLI tools: oc, kubectl, jq, kustomize

    Deploy the MaaS infrastructure

    You can deploy the entire platform with a single script. Run the following commands while logged into your OpenShift cluster as a cluster administrator:

    curl -sSLo deploy-rhoai-stable.sh \
    
    https://raw.githubusercontent.com/opendatahub-io/maas-billing/refs/tags/0.0.1/deployment/scripts/deploy-rhoai-stable.sh
    chmod +x deploy-rhoai-stable.sh
    
    MAAS_REF="0.0.1" ./deploy-rhoai-stable.sh

    The deployment script creates a new Gateway object named maas-default-gateway, which serves as the ingress point for the MaaS system.

    oc describe Gateway maas-default-gateway -n openshift-ingress # View Gateway Info
    oc get Gateway maas-default-gateway -n openshift-ingress -o jsonpath='{.spec.listeners[0].hostname}' # Gateway's Hostname

    You can find more information on the MaaS architecture here.

    Deploy a sample model and test rate limiting

    Now let's deploy a lightweight GPU model and demonstrate how MaaS enforces rate limits.

    Deploy the IBM Granite model

    Enter the following to start the deployment:

    # Deploy and immediately watch the pod status (one line)
    kustomize build 
    "https://github.com/opendatahub-io/maas-billing//docs/samples/models/ibm-granite-2b-gpu" \
      | kubectl apply -f - && kubectl get pods -n llm -w

    This model is MaaS-enabled through its Gateway reference to maas-default-gateway. You can find more information about this configuration in the model setup documentation.

    Retrieve access token

    Create an access token for authentication:

    CLUSTER_DOMAIN=$(kubectl get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}')
    
    TOKEN=$(curl -sSk -X POST "https://maas.${CLUSTER_DOMAIN}/maas-api/v1/tokens" \
      -H "Authorization: Bearer $(oc whoami -t)" \
      -H "Content-Type: application/json" \
      -d '{"expiration": "10m"}' | jq -r '.token')
    echo "Token: ${TOKEN:0:50}..."

    Call the model

    Make a simple inference request to validate authentication:

    # List available models
    curl -sSk "https://maas.${CLUSTER_DOMAIN}/maas-api/v1/models" \
      -H "Authorization: Bearer $TOKEN" | jq
    
    # Send an inference request
    curl -sSk -X POST "https://maas.${CLUSTER_DOMAIN}/llm/ibm-granite-2b-gpu/v1/chat/completions" \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ibm-granite/granite-3.1-2b-instruct",
        "messages": [{"role": "user", "content": "Hello! What is your name?"}],
        "max_tokens": 50
      }' | jq

    View tier information

    Tier information for the Developer Preview is stored in a ConfigMap within the maas-api namespace. You can view this configuration by using the following command. Additional details are available in the tier overview documentation.

    oc describe cm tier-to-group-mapping -n maas-api

    User groups are mapped to tiers. By default, the free tier includes the system:authenticated group, which is automatically granted to all authenticated users. This means your current user is assigned to the free tier.

    Experience rate limiting

    You can view the free tier rate limits by using the following commands:

    oc get TokenRateLimitPolicy gateway-token-rate-limits -n openshift-ingress -o jsonpath='{.spec.limits.free-user-tokens.rates}'
    oc get RateLimitPolicy gateway-rate-limits -n openshift-ingress -o jsonpath='{.spec.limits.free.rates}'

    We can see the rate limits are:

    • 5 requests per 2 minutes (request-based limit)
    • 100 tokens per minute (token-based limit)

    To demonstrate rate limiting in action, execute the following commands to exceed the request limit:

    # Send 10 rapid requests (free tier allows only 5 per 2 minutes)
    for i in {1..10}; do
      HTTP_CODE=$(curl -sSk -o /dev/null -w "%{http_code}" -X POST \
        "https://maas.${CLUSTER_DOMAIN}/llm/ibm-granite-2b-gpu/v1/chat/completions" \
        -H "Authorization: Bearer $TOKEN" \
        -H "Content-Type: application/json" \
        -d '{"model": "ibm-granite/granite-3.1-2b-instruct", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}')
      
      echo "Request $i: HTTP $HTTP_CODE (should be 200 for first 5)"
    done

    Expected output:

    Request 1: HTTP 200
    Request 2: HTTP 200
    Request 3: HTTP 200
    Request 4: HTTP 200
    Request 5: HTTP 429 ← Rate limited!
    Request 6: HTTP 429
    Request 7: HTTP 429
    Request 8: HTTP 429
    Request 9: HTTP 429
    Request 10: HTTP 429

    The HTTP 429 response indicates that the rate limit has been reached. Note that rate limiting is based on the total number of tokens reported by the LLM, so the number of successful requests might vary depending on response token counts. The quota resets after 2 minutes, demonstrating the fair usage controls in action.

    Different tier

    To create and test an additional tier, follow this example:

    # Edit the tier configuration to match your organization's needs: 
    kubectl edit configmap tier-to-group-mapping -n maas-api
    
    # Create premium group 
    oc adm groups new premium-group 2>/dev/null 
    
    # Add current user to premium group
    CURRENT_USER=$(oc whoami)
    oc adm groups add-users premium-group $CURRENT_USER
    
    # Verify membership
    oc get group premium-group
    
    CLUSTER_DOMAIN=$(kubectl get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}')
    
    TOKEN=$(curl -sSk -X POST "https://maas.${CLUSTER_DOMAIN}/maas-api/v1/tokens" \
      -H "Authorization: Bearer $(oc whoami -t)" \
      -H "Content-Type: application/json" \
      -d '{"expiration": "10m"}' | jq -r '.token')
    
    # Test PREMIUM tier (20 requests allowed)
    echo "Testing PREMIUM tier (20 requests per 2 minutes):"
    for i in {1..25}; do
      HTTP_CODE=$(curl -sSk -o /dev/null -w "%{http_code}" -X POST \
        "https://maas.${CLUSTER_DOMAIN}/llm/ibm-granite-2b-gpu/v1/chat/completions" \
        -H "Authorization: Bearer $TOKEN" \
        -H "Content-Type: application/json" \
        -d '{"model": "ibm-granite/granite-3.1-2b-instruct", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 5}')
      
      if [ "$HTTP_CODE" = "429" ]; then
        echo "Request $i: HTTP $HTTP_CODE ❌ (Rate limit hit)"
        break
      else
        echo "Request $i: HTTP $HTTP_CODE ✅"
      fi
    done

    Available UI

    For your convenience you can deploy a model using the MaaS user interface (UI). For more information on how to enable the UI, refer to the official repo.

    First, you have to create a project. Inside it, you will find a Deployments tab (Figure 1). There you can deploy your model, specifying the model location, model type, and other important information.

    The Deploy model page in the MaaS UI. The UI lets you deploy any model without needing to interact with a terminal.
    Figure 1: Deploying a model from within the MaaS UI.

    Note that you can deploy MaaS only on distributed runtime. Select it from the Serving runtime drop-down (Figure 2).

    A screenshot of the "Model deployment" step on the "Deploy a model" tab. The arrow points to the "Serving runtime" field, which is set to "Distributed Inference Server with llm-d." Below this is the "Number of replicas to deploy" set to 1. The "Hardware profile" is set to "default-profile" with details showing 2 Cores and 4 GiB of memory requested, and limits of 4 Cores and 8 GiB.
    Figure 2: Selecting a serving runtime in the Model deployment interface.

    What's next?

    You now have a working MaaS deployment with a sample model under governance. Here are resources to explore next:

    • Customize Tiers and Limits
    • Enable Model-Specific Access Control

    We've also set up observability on the cluster for you. You can build your dashboards in Grafana and connect it to the metrics in Prometheus, or use our default dashboard shown in Figure 3.

    The MaaS observability dashboard connected to the cluster with deployed MaaS. It shows information about users who use the model and the amount of tokens spent.
    Figure 3: MaaS observability dashboard.

    We encourage you to try the developer preview version of Model-as-a-Service and give us your feedback (refer to the Contributing section in our repo).

    For detailed documentation, visit the MaaS community documentation.

    Recent Posts

    • Introducing Models-as-a-Service in OpenShift AI

    • Building domain-specific LLMs with synthetic data and SDG Hub

    • External IP visibility in Red Hat Advanced Cluster Security

    • How I used Red Hat Lightspeed image builder to create CIS (and more) compliant images

    • Building a oversaturation detector with iterative error analysis

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue