How to get started with large language models and Node.js

Learn how to access a large language model using Node.js and LangChain.js. You’ll also explore LangChain.js APIs that simplify common requirements like retrieval-augmented generation (RAG).

In this lesson, you will learn about some of the key APIs in LangChain.js and run a simple application that interacts with a running model using Node.js.

In order to get full benefit from taking this lesson, you need:

  • An environment where you can install and run Node.js.
  • A Git client.

In this lesson, you will:

  • Install Node.js. 
  • Clone the ai-experimentation repository to get the sample lessons.
  • Explore the key LangChain.js APIs that are used in the basic example.
  • Run the example to send questions to the running model and display the responses.

Set up the environment

If you don’t already have Node.js installed, install it using one of the methods outlined on the Nodejs.org download page.

  1. Clone the ai-experimentation repository with:

    git clone https://github.com/mhdawson/ai-experimentation
  2. Change into the ai-experimentation/lesson-1-2 directory with:

    cd ai-experimentation/lesson-1-2
  3. Create a directory called models.
  4. Download the mistral-7b-instruct-v0.1.Q5_K_M.gguf model from HuggingFace and put it into the model’s directory. This might take a few minutes, as the model is over 5GB in size.

Explore a basic LangChain.js example

We will start by working through the contents of langchainjs-basic.mjs.

  1. The first thing we need to do is to load the model:

    ////////////////////////////////
    // GET THE MODEL
    const __dirname = path.dirname(fileURLToPath(import.meta.url));
    const modelPath = path.join(__dirname,
                                "models", 
                                "mistral-7b-instruct-v0.1.Q5_K_M.gguf")
    const { LlamaCpp } = await import("@langchain/community/llms/llama_cpp");
    const model = await new LlamaCpp({ modelPath: modelPath });
    

    This introduces the first LangChain.js API, which is for models. Instances of the model’s API allow you to easily load models using different back ends and then access the model with a common API.

    The example loads the model file that we downloaded earlier and stored in the model’s directory. We are using node-llama-cpp to load the model into the same process running Node.js. We won’t do that in production, where we most likely will be accessing an external model, but it’s great for getting started quickly. The magic of Node.js addons, along with node-addon-api (which we help maintain, which is cool) means that when you run npm install, the node-llama-cpp shared library needed to run the model (there are pre-built binaries for Linux, Windows, and Mac OS X) is either installed or compiled if necessary. If you want to learn more about Node.js addons or node-addon-api, check out the video Building native addons for Node.js (and more JavaScript engines) like it's 2023.

  2. The next step is to create a “chain” (it is called Langchain.js, after all):

    ////////////////////////////////
    // CREATE CHAIN
    const prompt =
      ChatPromptTemplate.fromTemplate(`Answer the following question if you don't know the answer say so:
    Question: {input}`);
    const chain = prompt.pipe(model);

    This introduces the next two LangChain.js APIs: prompts and chains. Prompts represent the question and related context that you are sending to the model, and the chain represents the steps that are used to build the question and context.

  3. Finally we can start asking questions:

    ////////////////////////////////
    // ASK QUESTION
    console.log(new Date());
    let result = await chain.invoke({
      input: "Should I use npm to start a node.js application",
    });
    console.log(result);
    console.log(new Date());

    In this step, we invoke the chain with the input, which asks if we should use npm to start a Node.js application.

At this point you might be wondering: Is that really all you need to run a model locally and ask questions? Surprisingly, yes, although it might be a bit slow. In our case, it took about 25 seconds to answer the question on a Ryzen 5700X with lots of memory.

We’ve kept the example as simple as possible and as a script instead of bundling it into an HTTP service. Our take is that creating a HTTP-based UI that takes input and displays a response is something Node.js developers will already know how to do. The interesting part is what you need to do behind the scenes to talk to the large language model.

Run the basic LangChain.js example

To run the example:

  1. Run npm install.
  2. Run node langchainjs-basic.mjs.
  3. When the example runs, first you’ll see:

    [user1@fedora lesson-1-2]$ node langchainjs-basic.mjs 
    llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/user1/src/learning-path/ai-experimentation/lesson-1-2/models/mistral-7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V2)
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = llama
    llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
    llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
    llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
    llama_model_loader: - kv   4:                          llama.block_count u32              = 32
    llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
    llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
    llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
    llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
    llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
    llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
    llama_model_loader: - kv  11:                          general.file_type u32              = 17
    llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
    llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
    llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
    llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
    llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
    llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
    llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
    llama_model_loader: - kv  19:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:   65 tensors
    llama_model_loader: - type q5_K:  193 tensors
    llama_model_loader: - type q6_K:   33 tensors
    llm_load_vocab: special tokens definition check successful ( 259/32000 ).
    llm_load_print_meta: format           = GGUF V2
    llm_load_print_meta: arch             = llama
    llm_load_print_meta: vocab type       = SPM
    llm_load_print_meta: n_vocab          = 32000
    llm_load_print_meta: n_merges         = 0
    llm_load_print_meta: n_ctx_train      = 32768
    llm_load_print_meta: n_embd           = 4096
    llm_load_print_meta: n_head           = 32
    llm_load_print_meta: n_head_kv        = 8
    llm_load_print_meta: n_layer          = 32
    llm_load_print_meta: n_rot            = 128
    llm_load_print_meta: n_embd_head_k    = 128
    llm_load_print_meta: n_embd_head_v    = 128
    llm_load_print_meta: n_gqa            = 4
    llm_load_print_meta: n_embd_k_gqa     = 1024
    llm_load_print_meta: n_embd_v_gqa     = 1024
    llm_load_print_meta: f_norm_eps       = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff             = 14336
    llm_load_print_meta: n_expert         = 0
    llm_load_print_meta: n_expert_used    = 0
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx  = 32768
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: model type       = 7B
    llm_load_print_meta: model ftype      = Q5_K - Medium
    llm_load_print_meta: model params     = 7.24 B
    llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
    llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
    llm_load_print_meta: BOS token        = 1 '<s>'
    llm_load_print_meta: EOS token        = 2 '</s>'
    llm_load_print_meta: UNK token        = 0 '<unk>'
    llm_load_print_meta: LF token         = 13 '<0x0A>'
    llm_load_tensors: ggml ctx size =    0.11 MiB
    llm_load_tensors:        CPU buffer size =  4892.99 MiB
    ...................................................................................................
    llama_new_context_with_model: n_ctx      = 4096
    llama_new_context_with_model: freq_base  = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
    llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
    llama_new_context_with_model:        CPU input buffer size   =    17.04 MiB
    llama_new_context_with_model:        CPU compute buffer size =   288.00 MiB
    llama_new_context_with_model: graph splits (measure): 1
    2024-03-11T22:06:38.328Z

    That’s the model being loaded and llama-cpp printing out some information about the model.

  4. Next you’ll see it pause for a while (about 25 seconds in our case) and then you should see an answer something like this:

    2024-03-11T22:08:23.372Z
    Assistant: Yes, you should use npm to start a Node.js application. NPM (Node Package Manager) is the default package manager for Node.js and it provides a centralized repository of packages that can be used in your applications. It also allows you to manage dependencies between packages and automate tasks such as testing and deployment. If you are new to Node.js, I would recommend using npm to get started with your application development.
    2024-03-11T22:08:45.774Z

If you’ve read the Node.js Reference Architecture, you’ll know that is not necessarily the answer we’d like people to get (we will revisit that later) but is not unexpected based on common practice.

Conclusion

In this lesson, we introduced the basic LangChain.js APIs, including those for models, prompts, and chains, worked through a simple example that uses those APIs ,and ran the example using a local model.

We’ll build on this in following lessons by:

  • Speeding things up if you have a GPU.
  • Building a more complex example that supports retrieval-augmented generation.
  • Showing how LangChain.js makes it easy to develop, experiment, and test in one environment while being able to easily deploy to another environment with minimal changes to your application.
Previous resource
Overview: How to get started with large language models and Node.js
Next resource
Use a GPU to speed up your LLM