Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. How to learn AI with Red Hat
  3. Get started with consuming GPU-hosted large language models on Developer Sandbox
  4. Interact with the hosted models using curl

Get started with consuming GPU-hosted large language models on Developer Sandbox

Learn the many ways you can interact with GPU-hosted large language models (LLMs) on Developer Sandbox, including connecting the model endpoints, interacting with the API endpoints using the hosted Red Hat OpenShift AI component, and standing up a web-based chat user interface.

Access the Developer Sandbox

In this lesson, we will make a connection to the hosted models and then prompt them to return some generated text. We will try this using all three of the available inference servers to demonstrate the models' varied behaviors.

We will use a small shell script to get the actual models hosted by each service. The services provide an API endpoint that mimics the OpenAI public endpoint. This design ensures seamless compatibility for clients configured to work with that specific API schema.

Prerequisites:

  • Access to the Developer Sandbox (A free trial is available).

In this lesson, you will:

  • Connect directly to the hosted models and perform a call to get a response.

Command the model

Follow these steps to command the model.

  1. Go back to the terminal as described in the previous lesson. If it has disconnected, click Reconnect in the terminal to reestablish the connection.

  2. Once in the terminal, type the following:

    git clone https://github.com/utherp0/sandboxgpullm
    cd sandboxgpullm
    ./getModels.sh
  3. The output will list the URL endpoint for the services and the model(s) it hosts (Figure 1).

    A terminal showing the models hosted by the inference services
    Figure 1: Models hosted by the inference services.
    Figure 1: This output shows the models hosted by the inference services.
  4. If you’re curious, here is the script to achieve this:

    for hostService in $(oc get isvc -n sandbox-shared-models -o jsonpath='{.items[*].status.address.url}');
    do echo $hostService
       curl -skL -H "Authorization: Bearer "$(oc whoami -t) $hostService/v1/models | jq -r '.data[].id';
       printf "\n\r"
    done

    Now we will try using the models. The curl command may be a little fiddly to type, especially given JSON’s formatting, so it is available in the GitHub repository you have already pulled down in the previous steps.

  5. Go back to the terminal and reconnect it if necessary. Make sure you are still in the directory from the previous command (sandboxgpullm). 

  6. Now type the following:

    ./testModel.sh | jq .choices[0].message.content
  7. The jq command, used in the previous example as well, gives you a nicely formatted version of the JSON returned. It should look something like Figure 2:

    A terminal readout showing a response from the Granite model.
    Figure 2: A response from the Granite model.
    Figure 2: This shows the response from the Granite model.
  8. The script looks like this:

    curl -kL -H "Authorization: Bearer "$(oc whoami -t) -H "Content-type: application/json" https://isvc-granite-31-8b-fp8-predictor.sandbox-shared-models.svc.cluster.local:8443/v1/chat/completions -d '{"model":"isvc-granite-31-8b-fp8", "messages": 
    [{"role":"system", "content":"You are an assistant that speaks in polite english"}, {"role":"user", "content":"What kind of model are you?"}]}'
    In this instance we have called the /chat/completions API endpoint, which takes an array of messages. We have passed in a system definition to tune the response (“You are an assistant that speaks in polite English”) and a quick prompt (“What kind of model are you?”).
  9. Now, repeat the query to all of the models. There is another script in the repository to do that. Type the following:

    ./testAllModels.sh
  10. If you’re curious, the script looks like this:

    curl -kL -H "Authorization: Bearer "$(oc whoami -t) -H "Content-type: application/json" https://isvc-granite-31-8b-fp8-predictor.sandbox-shared-models.svc.cluster.local:8443/v1/chat/completions -d '{"model":"isvc-granite-31-8b-fp8", "messages": [{"role":"system", "content":"You are an assistant that speaks in polite english"}, {"role":"user", "content":"What kind of model are you?"}]}' | jq .choices[0].message.content
    
    curl -kL -H "Authorization: Bearer "$(oc whoami -t) -H "Content-type: application/json" https://isvc-nemotron-nano-9b-v2-fp8-predictor.sandbox-shared-models.svc.cluster.local:8443/v1/chat/completions -d '{"model":"isvc-nemotron-nano-9b-v2-fp8", "messages": [{"role":"system", "content":"You are an assistant that speaks in polite english"}, {"role":"user", "content":"What kind of model are you?"}]}' | jq .choices[0].message.content
    
    curl -kL -H "Authorization: Bearer "$(oc whoami -t) -H "Content-type: application/json" https://isvc-qwen3-8b-fp8-predictor.sandbox-shared-models.svc.cluster.local:8443/v1/chat/completions -d '{"model":"isvc-qwen3-8b-fp8", "messages": [{"role":"system", "content":"You are an assistant that speaks in polite english"}, {"role":"user", "content":"What kind of model are you?"}]}' | jq .choices[0].message.content
    

    This script has the jq already embedded for ease of use.

We have now directly interacted with all three of the GPU-hosted LLMs successfully, but curl isn’t the easiest way to do it. Next, we will use the Red Hat OpenShift AI components to write Python code directly against the models.

Previous resource
Log in and examine the models
Next resource
Interact with the GPU-hosted models using Jupyter notebook and langchain