Get started with consuming GPU-hosted large language models on Developer Sandbox

Everyone is talking about artificial intelligence (AI) at the moment, and while AI has been around for a long time in the form of predictive AI, the current interest is in generative AI and using a technology called large language model (LLM).

Put simply, an LLM contains billions/trillions of vectors that indicate token position probability, token importance, and relevance. In plain language, that means that an LLM contains a huge amount of positional information used probabilistically to predict token order based on a tokenised prompt.

The sheer amount of vector data, trained using huge data sets, gives an LLM the ability to produce contextual and human-like responses. This is a great benefit to a large population for whom dealing with data, primarily through search engines and websites, has been a cold and machine-like experience.

The downside of LLMs is that the huge amount of data required to be searched and multiplied, means that the use of normal CPUs, as powerful as they are, can slow down the process. CPUs are multi-task units that handle registers, memory, and other atomic operations for these models.

GPUs were originally designed for pure graphics processing, engineered around the fast processing of vectors and parallel vector arithmetic. As far back as the late 1990s, organizations dealing with huge datasets were turning to graphics processors to do it, with serious increases in performance.

In recent years, some hardware providers have invested in special-purpose AI accelerator chips, rather than the more general-purpose GPU chips, that formed the basis of accelerating this type of workload. Many popular libraries and frameworks for working with AI and machine learning have adopted a modular approach to enable different GPU and accelerator architectures. These new AI accelerator chips are specifically designed to perform the arithmetic and neural net handling needed by the LLMs. Over time, we will probably see a shift to these NPUs as opposed to the GPUs. For the sake of this exercise, we will be looking at the use of GPUs, specifically ones hosted on the Developer Sandbox.

To help developers and interested parties quickly understand how to interact with these large language models (LLMs), Red Hat has developed and shipped the Red Hat AI portfolio. This portfolio includes Red Hat OpenShift AI, Red Hat Enterprise Linux AI, and the Red Hat Inference Server. Red Hat has invested in GPU-enabled worker nodes for the Developer Sandbox to provide users with access to these technologies and fast GPUs. Several large language models have been directly hosted on these nodes, allowing users to consume and test them.

This learning path will walk you through several different ways to interact directly with these hosted models. Using your free Developer Sandbox account, you will directly connect the model endpoints (via a curl), write some quick Python code to interact with the API endpoints using the hosted OpenShift AI component, and finally, stand up a web-based chat user interface that will allow you to compare and contrast the responses from the three currently hosted models.

There will be another learning path in the near future to walk you through deploying models on a CPU-based worker node in OpenShift to compare against the performance of those hosted on an accelerator-equipped worker node. This hands-on demonstration will quickly get you started interacting and developing against a GPU-hosted model.

Prerequisites:

Access to the Developer Sandbox (A free trial is available.)

In this learning path, you will:

Directly connect the model endpoints with curl.
Use Python code to interact with the API endpoints using the hosted OpenShift AI component.
Stand up a web-based chat user interface that will allow you to compare and contrast the responses from the three currently hosted models.