Boring RAG: When similarity is just a SQL query

Retrieval-augmented generation (RAG) is a practical way to answer questions using your own content (such as policies, docs, tickets, and product descriptions) without assuming a general LLM model already contains that information.

At its core, RAG follows a "retrieve context, then answer" pattern. Retrieval is the part that often becomes overcomplicated. Once you store embeddings alongside text in a database, retrieval becomes a standard nearest-neighbor query. In other words: similarity is a query.

This article demonstrates a "boring" implementation using Apache Camel, PostgreSQL, and pgvector. The goal is to create a baseline that is easy to understand and debug. You can see exactly what was indexed, what was retrieved, and the context provided to the model.

If you want the bigger-picture framing LLMs as semantic processors and keeping the "AI parts" at the edges, read Making LLMs boring: From chatbots to semantic processors.

A quick glossary

An embedding is a vector (a list of numbers) produced from text. Similar text tends to end up near each other in that vector space.

Chunking is splitting a document into smaller pieces before embedding it. It's rarely optional. Without chunking, you retrieve entire documents when you only need a paragraph.

pgvector adds a vector(N) column type and distance operators <=> to PostgreSQL. This allows you to store embeddings and run similarity searches using SQL.

The anatomy of a RAG pipeline

Most RAG systems rely on three primary steps.

First, you index the content. This involves taking the context, chunking it, and storing the text within its vector. This is typically a batch job.
Second, you retrieve information. When a user asks a question, the system embeds the question and queries the database for the nearest chunks. You usually apply a similarity threshold (to avoid weak matches) and a topK (to keep context bounded).
Third, you provide an answer. If the retrieval finds no matches, the system returns a "not found" response or asks a clarifying question. If retrieval found something, you pass the retrieved chunks into the prompt as context and tell the model to answer using only that context.

Let's make those steps concrete.

Step 1: Index (chunk → embed → store)

Indexing transforms static files into a queryable knowledge base. At a minimum, you should store the chunk text, a little metadata to help with tracing (such as the source, section ID, and document name), and the embedding vector.

With pgvector, a basic schema looks like this:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS chunks (
   id SERIAL PRIMARY KEY,
   content TEXT NOT NULL,
   source VARCHAR(255),
   chunk_index INTEGER,
   embedding vector(768),
   created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

The vector(768) dimension must match your embedding model. If you switch embedding models, you might need a different dimension (and usually a reindex).

Use the following Camel route to implement the chunk → embed → store process, as shown in Figure 1:

- beans:
    - name: markdownSemanticTokenizer
      type: org.apache.camel.tokenizer.MarkdownSemanticTokenizer
      properties:
        headerMode: "RAG_CONTEXT"
- route:
    id: index-files
    description: "Document Indexing"
    from:
      uri: file:documents
      parameters:
        noop: true
        include: ".*\\.md"
      steps:
        - setVariable:
            description: "Save File Name"
            name: fileName
            simple: "${header.CamelFileName}"
        - split:
            description: "Split Markdown"
            method:
              ref: "markdownSemanticTokenizer"
            steps:
              - setVariable:
                  description: "Save Index"
                  name: chunkIndex
                  simple: "${exchangeProperty.CamelSplitIndex}"
              - setVariable:
                  description: "Save Chunk"
                  name: chunkText
                  simple: "${body.trim()}"
              - to:
                  description: "Generate Embedding"
                  uri: openai:embeddings
              - setVariable:
                  description: "Save Vector"
                  name: embeddingVector
                  simple: "${body.toString()}"
              - to:
                  description: "Insert into DB"
                  uri: sql:INSERT INTO chunks (content, source, chunk_index, embedding) VALUES (:#chunkText, :#fileName, :#chunkIndex, :#embeddingVector::vector)

Kaoto Document Q&A Indexing — Figure 1: Document Indexing Pipeline in Kaoto Integration Designer.

Note

This route assumes you are starting with clean Markdown files. In the real world, enterprise knowledge is usually locked in PDFs or Word documents. To handle this, you can drop the camel-docling step docling:CONVERT_TO_MARKDOWN into your route. Powered by IBM's AI document parser, camel-docling understands complex document layouts including reading order, multi-column text, and even tables and seamlessly converts them into structured Markdown.

You can then index the documents by running the following camel command:

camel run index-documents.camel.yaml utils/* application.properties

Apache Camel includes more than 300 components, allowing you to ingest documents and data from wherever your enterprise stores them: Amazon S3, Google Drive, Azure Files, Brokers, Salesforce, JIRA, or secure FTP servers.

Step 2: Retrieve (similarity as SQL)

At query time, the system embeds the user's question and runs a nearest-neighbor query against the stored vectors.

In pgvector, <=> is a distance operator where a smaller value indicates a closer match). A common pattern is to convert distance into a similarity score (often 1 - distance), filter by a threshold, then take the top results.

Using camel-openai for embeddings, this workflow involves calling the openai:embeddings endpoint and then running an sql:SELECT query with the resulting vector.

- route:
    id: document-qa-route
    description: RAG Pipeline
    from:
      description: document-qa
      uri: direct
      parameters:
        name: document-qa
      steps:
        - setVariable:
            description: Save Question
            name: question
            simple: ${body.trim()}
        - log:
            description: Log Question
            message: "Question: ${variable.question}"
        - to:
            description: Get Embeddings
            uri: openai:embeddings
        - setVariable:
            description: Save Embedding
            name: queryEmbedding
            simple: ${body.toString()}
        - to:
            description: Vector Search
            uri: >
              sql:SELECT content, source,
                   1 - (embedding <=> :#queryEmbedding::vector) as similarity
              FROM chunks
              WHERE 1 - (embedding <=> :#queryEmbedding::vector) > {{rag.similarity.threshold}}
              ORDER BY embedding <=> :#queryEmbedding::vector
              LIMIT {{rag.topK}}

Two settings are important during the initial configuration:

threshold: (0.6) This setting prevents the system from adding weakly related chunks to the prompt.
topK: (5) This parameter limits the amount of context provided to the model.

Step 3: Answer (or refuse)

After the system retrieves the rows, it passes them into the model prompt as context and instructs the model to answer using only that information. This grounds the model in the retrieved data and helps prevent it from improvising.

You must decide what happens if the retrieval step returns no results. For internal knowledge bases, forcing an answer is a recipe for hallucinations. A response such as, "I do not have enough information in the provided documents to answer that," is useful because it identifies where to improve your corpus, chunking strategy, or threshold limits.

        - setVariable:
            description: Save Results
            name: searchResults
            simple: ${body}
        - log:
            description: Log Results Count
            message: Found ${body.size()} relevant chunks
        - choice:
            description: Check Results
            otherwise:
              steps:
                - setVariable:
                    description: Prepare Context
                    name: context
                    simple: ${variable.searchResults}
                - setBody:
                    description: Set User Prompt
                    simple: ${variable.question}
                - setHeader:
                    description: System Instructions
                    name: CamelOpenAISystemMessage
                    simple: >
                      Answer the question using ONLY the context below.
                      If the context doesn't contain enough information, say "I
                      don't have complete information on that."
                      Be concise and cite the source when relevant.
                      Context:
                      ${variable.context}
                - to:
                    description: Generate Answer
                    uri: openai:chat-completion

Now you are ready to ask the questions:

echo "What is the return policy?" | camel run document-qa.camel.yaml application.properties

Will the model always stay perfectly inside the lines? Not always. This design makes the failure mode visible by allowing you to log retrieved rows and verify the context provided to the model, as illustrated in Figure 2:

Kaoto Document Q&A — Figure 2: Document Q&A RAG Pipeline in Kaoto Integration Designer.

Beyond document Q&A: Reusing the pattern

The beauty of storing embeddings in PostgreSQL and treating similarity as a SQL query is that you aren't limited to building Q&A chatbots. Once the core foundation of indexing and retrieval is in place, you can adapt the final answer phase to solve various engineering challenges.

Because you are using standard SQL, you can easily join your vector similarity searches with your existing business logic.

Semantic product search

Standard keyword searches often fail if users are unfamiliar with your exact terminology. By embedding your product catalog, you can map fuzzy user inputs ("a large screen for design work") to the closest items in vector space. From there, you have options: you can return the database rows directly to the UI for a fast, deterministic search experience, or you can pass the retrieved rows to an LLM to generate a conversational summary of their options.

Automated ticket deduplication

You don't always need an LLM at the end of a RAG pipeline; sometimes, you can skip the "answer" step entirely. When a new support ticket is submitted, embed the text and run a similarity query against your historical, closed tickets. If the similarity score crosses a high threshold, you can automatically link the new ticket as a duplicate or route it to the exact engineer who solved the previous issue.

By treating vectors as standard database rows, you can transform AI features into common backend engineering tasks.

Limitations

This baseline works well, but there are practical limits you should anticipate as you scale:

Chunking strategies significantly affect retrieval quality. If a chunk boundary separates a rule from its exception, the system might retrieve text that appears relevant but leads to an incorrect answer. This is a data-shaping problem more than a model problem.
Similarity is not correctness. Nearest neighbor means "close in embedding space," not "true," "complete," or "up to date." In practice you often combine vectors with metadata filters (source, version, access control) and keyword search for exact terms.
threshold and topK tuning is unavoidable. Too low and you inject noise. Too high and you refuse too often. You adjust based on real queries and real failure cases.
Cost and latency can add up as the system scales. Many RAG flows are two model calls per request (embeddings plus chat completion). At scale, caching and batching become important.

While a more advanced stack (like hybrid keyword + vector search + reranking) can outperform this baseline on relevance, the trade-off is complexity. Start simple, and only add components when you have a concrete metric that demands it.

Takeaway

The primary advantage of this "boring RAG" approach is that it transforms a complex system into a standard software engineering task. By treating semantic search as a SQL query and keeping an air gap between your AI and your database, you ensure that every failure mode (e.g. a bad retrieval, wrong context, or a bad answer) is completely isolated and debuggable.

Start simple. Once your core pipeline is running smoothly, you can confidently introduce complexity like advanced chunking or hybrid search exactly where the metrics tell you to.

Next steps

You can find fully runnable Apache Camel routes for document Q&A, product similarity, and ticket deduplication in the companion GitHub repository.

Boring RAG: When similarity is just a SQL query

Retrieval-augmented generation with Apache Camel, PostgreSQL, and pgvector

A quick glossary

The anatomy of a RAG pipeline

Step 1: Index (chunk → embed → store)

Note

Step 2: Retrieve (similarity as SQL)

Step 3: Answer (or refuse)

Beyond document Q&A: Reusing the pattern

Semantic product search

Automated ticket deduplication

Limitations

Takeaway

Next steps

Smarter data generation for faster Speculator training

5 anti-patterns that cause Kubernetes operator vulnerabilities

Track model usage with the OpenShift AI 3.4 usage dashboard

Red Hat build of Quarkus 3.33: Stability and performance advancements for enterprise Java

Batch inference on OpenShift AI with llm-d: Architecture, integration, and workflows

How to create a Camel integration and deploy it as a serverless service

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links