Page
Use RHEL AI to interact with your model

Now that the reference models have been successfully downloaded, let’s serve one of the models so that it can be used for tasks such as alignment, inference, or chat.
Prerequisites:
- RHEL AI with the bootable container image on compatible hardware. Get the RHEL AI image here.
- A Red Hat registry account on your machine. Create your Red Hat Registry account here.
- Root user access on your machine.
In this learning path, you will:
- Serve a model.
- Serve a model as a service on RHEL AI.
- Chat with the model.
Serve and interact with the model
Note: This step assumes you downloaded the vLLM model using the previous instructions as to where it would clone. If you cloned it elsewhere, use the directory you cloned it into as the model-path.
To start the model serving process, run:
ilab model serve --model-path /var/home/instruct/temp/granite-7b-lab
You should see a log message saying:
vLLM starting up on pid 34 at http://127.0.0.1:8000/v1
This indicates that the model is being served locally using the vLLM backend, and is accessible at http://127.0.0.1:8000/v1.
Depending on the memory capabilities of the GPUs, you may see some ValueErrors around the sizes of the swap space; these are ignorable.
Chat with the model
In order for this example to work with all GPU configurations, we are going to have to wind back the use of parallel GPUs, as this exercise can be run with any GPU configuration.
To do this in the console of the RHEL AI machine, type:
vi /var/home/instruct/.config/instructlab/config.yaml
Note: When in the editor, find every instance of vllm (there will be one per model added) and change the gpus: 8 fields to gpus: 1. Also, if any of the vllm entries have a vllm_args: component with --tensor-parallel-size, set the value of that to 1. This removes any parallelism of the vllm engine. If you know how many GPUs you have access to, you can reset these values to the count of GPUs. Here's an example:
vllm:
# Number of GPUs to use.
# Default: None
gpus: 1
# Large Language Model Family
# Default: ''
# Examples:
# - granite
# - mixtral
llm_family: ''
# Maximum number of attempts to start the vLLM server.
# Default: 120
max_startup_attempts: 120
# vLLM specific arguments. All settings can be passed as a list of strings, see:
# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
# Default: []
# Examples:
# - ['--dtype', 'auto']
# - ['--lora-alpha', '32']
vllm_args:
- --tensor-parallel-size
- '1'
For this step, we will need two consoles going into the machine. In the original console (the one we have been using), type:
ilab model serve --model-path /var/home/instruct/temp/granite-7b-lab
Now, open another Terminal window (or tab on your Terminal window), and in the new tab, SSH into the RHEL AI instance.
With the service running in the background, you can now interact with the model using the chat interface:
ilab model chat --model /var/home/instruct/temp/granite-7b-lab
In the chat window, type:
tell me about Paris, France
Exit the chat interface and stop the model
Once you’ve finished interacting with the InstructLab chat model, it’s important to exit the chat interface and properly stop the model serving service. This is crucial especially if you’re using cloud resources, because running workloads unnecessarily can lead to unwanted billing and the expense of GPU usage.
To exit the chat interface safely, type exit:
exit
After exiting the chat, it’s a good practice to stop the model serving service to conserve resources. Simply shift to the other Terminal and Ctrl-C the model.
By following these steps, you can effectively manage your resources and avoid incurring additional costs associated with running GPU workloads in the cloud. Properly exiting chat sessions and stopping background services is a key practice in maintaining efficient operations in AI model serving environments.
Summary
Congratulations on successfully setting up and interacting with the InstructLab chat model using RHEL AI! You have achieved several key milestones:
- RHEL AI Environment set up: You’ve learned how to configure Red Hat Registry access and Red Hat Insights.
- Model Download: You successfully downloaded the foundational model from the Red Hat Registry and prepared it for use.
- Model Serving: You configured the model to run as a background service, allowing for easy access without needing multiple SSH sessions.
- Chat with the model: You were able to send messages to the LLM and get responses.
- Exit the environment: You learned the importance of exiting the chat interface and stopping the model serving service to manage costs effectively, especially when using cloud resources.
Ready to learn more?
For more information and to explore the full capabilities of RHEL AI, please visit the RHEL AI product page. Thank you for your engagement, and keep up the great work on your AI journey!