Page

Use a GPU to speed up your LLM

March 29, 2024

Michael Dawson

In the previous lesson, we interacted with a model running locally. While this worked, it might have been slower than would be useful in the inner development loop. In this lesson, we will look at how we can speed up a locally running model by leveraging a graphics processing unit (GPU).

If you are on a newer ARM-based macOS machine, good news—GPU acceleration is already enabled and you can skip to Lesson 3.

In order to get full benefit from taking this lesson, you need to:

Have an NVIDIA-based GPU.
Install the NVIDIA SDK and C/C++ compiler for your system.

In this lesson, you will:

Install the NVIDIA SDK.
Install the C/C++ compiler for your platform.
Recompile node-llama-cpp to enable GPU acceleration.
Run the example to send questions to the running model and get the responses, and see that it executes much faster now.

Set up the environment

First, install the CUDA toolkit (version 12.x or higher).
Next, install the C/C++ compiler for your platform, including support for CMake and CMake.js. For Windows, that would be the Microsoft C++ Build Tools, and for Linux, Clang or GCC, along with Ninja and Make. More detailed instructions are available in the requirements section of the cmake-js README.

Recompile node-llama-cpp

To recompile node-llama-cpp, run the following command in the lesson-1-2 directory from the last lesson:
```
npx --no node-llama-cpp download --cuda
```
Copy snippet
This will rebuild node-llama-cpp with CUDA enabled. It might take a few minutes; you should see the output of the compilation taking place. This compilation was avoided by the default install because node-llama-cpp includes pre-built binaries using node-addon-api for Linux, macOS, and Windows. Our compilation ended up with:
```
√ Compiled llama.cpp
                        Repo: ggerganov/llama.cpp
                        Release: b2249
                        Done
```
Copy snippet
If the compilation fails, you should double-check your compiler and CUDA toolkit install. If the compilation has trouble finding the NVIDIA toolkit, we recommend that you restart your machine; this resolved the problem for us.

Profit from GPU acceleration

Look at the langchainjs-basic-gpu.mjs example, in which the only modification from our first example is the addition of the gpuLayers option when creating the model.

const model = await new LlamaCpp({ modelPath: modelPath,
                                                           gpuLayers: 64 });

Copy snippet

Now that we have a CUDA-accelerated version ofnode-llama-cpp, run the same example as before with:
```
node langchainjs-basic-gpu.mjs
```
Copy snippet

In our case, the time needed to answer the question dropped from about 25 seconds to about 3 seconds. That’s much easier to experiment with! Depending on your GPU, you might need to experiment with how many GPU layers you can offload.

Conclusion

Now that we can experiment locally at a faster pace, we will build on the earlier example by:

Building a more complex example that supports retrieval-augmented generation.
Showing how LangChain.js makes it easy to develop, experiment, and test in one environment while being able to easily deploy to another environment with minimal changes to your application.

Report a website issue

How to get started with large language models and Node.js

Path resource: Use a GPU to speed up your LLM

In order to get full benefit from taking this lesson, you need to:

In this lesson, you will:

Set up the environment

Recompile node-llama-cpp

Profit from GPU acceleration

Conclusion

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue