Utilize Retrieval-Augmented Generation (RAG) with Node.js to optimize your AI applications

In this learning exercise we will use Retrieval Augmented Generation (RAG) with a Node.js application in order to optimize an AI application. We will leverage the Node.js Reference Architecture in order to improve how the AI application answers a question on how to start a Node.js application. We will use Langchain.js to simplify interacting with the model and will use node-llama-cpp to run the model in the same Node.js process as your application.

Red Hat Build of Node.js  OpenShift.AI

Overview: Utilize Retrieval-Augmented Generation (RAG) with Node.js to optimize your AI applications

The goal of this learning exercise is to show you how to use Retrieval Augmented Generation (RAG) in order to optimize an AI application using Node.js. We will leverage the knowledge in the Node.js Reference Architecture in order to improve how the AI application answers a question on how to start a Node.js application. The example uses Langchain.js to simplify interacting with the model and node-llama-cpp to run the model in the same Node.js process as your application.

If you have been reading about Large Language Models you likely know that two key aspects of working with a model are crafting the question posed to the model and managing the context that goes along with the question. These are often covered as part of prompt engineering. Retrieval Augmented Generation is one of the ways that you can provide context that helps the model respond with the most appropriate answer. The basic concept is:

  • Data which provides additional context, in our case the markdown files from the Node.js Reference Architecture is transformed into a format suitable for model augmentation. This often includes breaking up the data into maximum sized chunks that will later be provided as additional context to the model.
  • An embedding model is used to convert the source data into a set of vectors that represents the words in the data. These are stored in a database so the data can be retrieved through a query against the matching vectors. Most commonly the data is stored in a vector database like Chroma.
  • The application is enhanced so that before passing on a query to the model it first uses the question to query the database for matching documents. The most relevant documents are then added to the context sent along with the question to the model as part of the prompt.
  • The model returns an answer which in part is based on the context provided.

So why not just pass the content of all of the documents to the model? The problem is that the size of the context that can be passed to a model is limited. For that reason we need to identify the more relevant context and pass only that subset. Now that you understand the concept of Retrieval Augmented Generation let's dive into the code.