How to get started with large language models and Node.js

Learn how to access a large language model using Node.js and LangChain.js. You’ll also explore LangChain.js APIs that simplify common requirements like retrieval-augmented generation (RAG).

In the previous lesson, we interacted with a model running locally. While this worked, it might have been slower than would be useful in the inner development loop. In this lesson, we will look at how we can speed up a locally running model by leveraging a graphics processing unit (GPU).  

If you are on a newer ARM-based macOS machine, good news—GPU acceleration is already enabled and you can skip to Lesson 3

In order to get full benefit from taking this lesson, you need to:

  • Have an NVIDIA-based GPU.
  • Install the NVIDIA SDK and C/C++ compiler for your system.

In this lesson, you will:

  • Install the NVIDIA SDK.
  • Install the C/C++ compiler for your platform.
  • Recompile node-llama-cpp to enable GPU acceleration.
  • Run the example to send questions to the running model and get the responses, and see that it executes much faster now.

Set up the environment

  1. First, install the CUDA toolkit (version 12.x or higher).
  2. Next, install the C/C++ compiler for your platform, including support for CMake and CMake.js. For Windows, that would be the Microsoft C++ Build Tools, and for Linux, Clang or GCC, along with Ninja and Make. More detailed instructions are available in the requirements section of the cmake-js README.

Recompile node-llama-cpp

  1. To recompile node-llama-cpp, run the following command in the lesson-1-2 directory from the last lesson:

    npx --no node-llama-cpp download --cuda
  2. This will rebuild node-llama-cpp with CUDA enabled. It might take a few minutes; you should see the output of the compilation taking place. This compilation was avoided by the default install because node-llama-cpp includes pre-built binaries using node-addon-api for Linux, macOS, and Windows. Our compilation ended up with:

    √ Compiled llama.cpp
    
    Repo: ggerganov/llama.cpp
    Release: b2249
    Done
  3. If the compilation fails, you should double-check your compiler and CUDA toolkit install. If the compilation has trouble finding the NVIDIA toolkit, we recommend that you restart your machine; this resolved the problem for us.

Profit from GPU acceleration

  1. Look at the langchainjs-basic-gpu.mjs example, in which the only modification from our first example is the addition of the gpuLayers option when creating the model.

    const model = await new LlamaCpp({ modelPath: modelPath,
                                       gpuLayers: 64 });
  2. Now that we have a CUDA-accelerated version ofnode-llama-cpp, run the same example as before with:

    node langchainjs-basic-gpu.mjs

In our case, the time needed to answer the question dropped from about 25 seconds to about 3 seconds. That’s much easier to experiment with! Depending on your GPU, you might need to experiment with how many GPU layers you can offload.

Conclusion

Now that we can experiment locally at a faster pace, we will build on the earlier example by:

  • Building a more complex example that supports retrieval-augmented generation.
  • Showing how LangChain.js makes it easy to develop, experiment, and test in one environment while being able to easily deploy to another environment with minimal changes to your application.
Previous resource
Getting to know LangChain.js
Next resource
Retrieval-augmented generation