Download, serve, and interact with LLMs on RHEL AI

Configure your RHEL AI machine, download, serve, and interact with large language models (LLM) using RHEL AI and InstructLab, and discover how developers can benefit from AI models tailored to their needs.

Download RHEL AI

Now that the reference models have been successfully downloaded, let’s serve one of the models so that it can be used for tasks such as alignment, inference, or chat.

Prerequisites:

In this learning path, you will:

  • Serve a model.
  • Serve a model as a service on RHEL AI.
  • Chat with the model.

Serve and interact with the model

Note: This step assumes you downloaded the vLLM model using the previous instructions as to where it would clone. If you cloned it elsewhere, use the directory you cloned it into as the model-path.

To start the model serving process, run:

ilab model serve --model-path /var/home/instruct/temp/granite-7b-lab

You should see a log message saying:

vLLM starting up on pid 34 at http://127.0.0.1:8000/v1

This indicates that the model is being served locally using the vLLM backend, and is accessible at http://127.0.0.1:8000/v1.

Depending on the memory capabilities of the GPUs, you may see some ValueErrors around the sizes of the swap space; these are ignorable.

Chat with the model

In order for this example to work with all GPU configurations, we are going to have to wind back the use of parallel GPUs, as this exercise can be run with any GPU configuration. 

To do this in the console of the RHEL AI machine, type:

vi /var/home/instruct/.config/instructlab/config.yaml

Note: When in the editor, find every instance of vllm (there will be one per model added) and change the gpus: 8 fields to gpus: 1. Also, if any of the vllm entries have a vllm_args: component with --tensor-parallel-size, set the value of that to 1. This removes any parallelism of the vllm engine. If you know how many GPUs you have access to, you can reset these values to the count of GPUs. Here's an example:

vllm:
    # Number of GPUs to use.
    # Default: None
    gpus: 1
    # Large Language Model Family
    # Default: ''
    # Examples:
    #   - granite
    #   - mixtral
    llm_family: ''
    # Maximum number of attempts to start the vLLM server.
    # Default: 120
    max_startup_attempts: 120
    # vLLM specific arguments. All settings can be passed as a list of strings, see:
    # https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
    # Default: []
    # Examples:
    #   - ['--dtype', 'auto']
    #   - ['--lora-alpha', '32']
    vllm_args:
      - --tensor-parallel-size
      - '1'

For this step, we will need two consoles going into the machine. In the original console (the one we have been using), type:

ilab model serve --model-path /var/home/instruct/temp/granite-7b-lab

Now, open another Terminal window (or tab on your Terminal window), and in the new tab, SSH into the RHEL AI instance.

With the service running in the background, you can now interact with the model using the chat interface:

ilab model chat --model /var/home/instruct/temp/granite-7b-lab

In the chat window, type:

tell me about Paris, France

Exit the chat interface and stop the model

Once you’ve finished interacting with the InstructLab chat model, it’s important to exit the chat interface and properly stop the model serving service. This is crucial especially if you’re using cloud resources, because running workloads unnecessarily can lead to unwanted billing and the expense of GPU usage.

To exit the chat interface safely, type exit:

exit

After exiting the chat, it’s a good practice to stop the model serving service to conserve resources. Simply shift to the other Terminal and Ctrl-C the model.

By following these steps, you can effectively manage your resources and avoid incurring additional costs associated with running GPU workloads in the cloud. Properly exiting chat sessions and stopping background services is a key practice in maintaining efficient operations in AI model serving environments.

Summary

Congratulations on successfully setting up and interacting with the InstructLab chat model using RHEL AI! You have achieved several key milestones:

  • RHEL AI Environment set up: You’ve learned how to configure Red Hat Registry access and Red Hat Insights.
  • Model Download: You successfully downloaded the foundational model from the Red Hat Registry and prepared it for use.
  • Model Serving: You configured the model to run as a background service, allowing for easy access without needing multiple SSH sessions.
  • Chat with the model: You were able to send messages to the LLM and get responses.
  • Exit the environment: You learned the importance of exiting the chat interface and stopping the model serving service to manage costs effectively, especially when using cloud resources.

Ready to learn more?

For more information and to explore the full capabilities of RHEL AI, please visit the RHEL AI product page. Thank you for your engagement, and keep up the great work on your AI journey!

Previous resource
Configure RHEL AI and download a foundational model