Chapter 8: Retrieval Augmented Generation

13 min read

Prerequisites for Understanding RAG

Large Language Models (LLMs) as Generative Engines (from Chapter 1):
- Core Idea: LLMs, especially decoder-only architectures like GPT, are fundamentally next-token predictors. Given some input text (a prompt), they generate a plausible continuation.
- The “Knowledge” Limitation: Their knowledge is “frozen” at the time of their last training. They don’t know about events or information that occurred after that.
- The Hallucination Problem: Because they are so good at generating fluent, confident-sounding text, they can sometimes generate incorrect or nonsensical information with high confidence. They are trying to complete a pattern, not necessarily state a verified fact from an internal database.
- What RAG tries to achieve here: Provide the LLM with fresh, relevant, and factual information at the time of generation to guide its output and make it more accurate and less prone to hallucination.
Embeddings: The Language of Meaning (from Chapter 2):
- Core Idea: Embeddings are numerical representations (vectors) of text (words, sentences, documents) in a high-dimensional space.
- Semantic Similarity: The crucial property is that texts with similar meanings will have embeddings that are “close” to each other in this vector space. “The cat is furry” will be closer to “My feline is fluffy” than to “The car is fast.”
- How they are made: We saw word2vec, and more advanced models like BERT or Sentence-Transformers produce these.
- What RAG tries to achieve here: Use embeddings to find pieces of text from a knowledge source that are semantically similar (i.e., relevant) to a user’s query.
Dense Retrieval / Semantic Search (Implicit in Chapter 2, core to Chapter 8):
- Core Idea: The process of finding the most relevant documents (or text chunks) from a large collection (a “text archive” or “knowledge base”) in response to a user query, based on semantic similarity of their embeddings.
- The Mechanism:
  1. Indexation: Convert all documents/chunks in your knowledge base into embeddings and store them (often in a specialized “vector database”).
  2. Querying: When a user asks a question, convert that question into an embedding.
  3. Search: Compare the query embedding with all the document embeddings in your index and retrieve the “nearest neighbors” – the documents whose embeddings are closest to the query embedding.
- What RAG tries to achieve here: This is the “Retrieval” part of RAG. It’s the engine that pulls out the relevant context.
Chunking Long Texts (Practical aspect for Retrieval, hinted at in Chapter 8):
- Core Idea: LLMs have a finite context window (the maximum number of tokens they can process). You often can’t feed an entire large document to an embedding model or into the final LLM prompt.
- The Solution: Break down large documents into smaller, manageable “chunks” (e.g., paragraphs, sentences, or fixed-size token blocks). Each chunk then gets its own embedding.
- What RAG tries to achieve here: Ensure that the retrieval step can pinpoint specific relevant pieces of information from large documents, rather than just getting a vague embedding of the whole document.
Prompt Engineering (Rudimentary in Chapter 1, full focus in Chapter 6):
- Core Idea: Crafting effective input prompts to guide an LLM to produce the desired output.
- Providing Context: A key technique in prompt engineering is giving the LLM relevant context along with the instruction or question.
- What RAG tries to achieve here: RAG automates the process of finding highly relevant context (from the retrieval step) and then constructs a new prompt that includes this context along with the original user query to feed into the generative LLM.

So, to recap the prerequisites:

LLMs generate text but can hallucinate.
Embeddings capture meaning and allow similarity comparison.
Dense retrieval uses embeddings to find relevant text for a query.
Chunking makes large documents searchable.
Prompt engineering is how we tell an LLM what to do, and RAG uses it to provide retrieved context.

With these building blocks in mind, we are perfectly set to explore Chapter 8!

Chapter 8: Semantic Search and Retrieval-Augmented Generation.

(Gestures to an imaginary slide with the chapter title)

Search engines were some of the very first large-scale applications of language models. Google announced using BERT for its search way back, calling it a “huge leap.” Microsoft Bing followed suit. Why? Because these models enabled semantic search – searching by meaning, not just keywords.

But then came the generative models, like ChatGPT. People started asking them questions expecting factual answers. And while they’re fluent, they aren’t always correct. This led to the problem of “hallucinations.” One of the best ways to combat this is Retrieval-Augmented Generation (RAG) – building systems that retrieve relevant information before generating an answer. This is one of the hottest applications of LLMs right now.

Overview of Semantic Search and RAG

Chapter 8 looks at three broad categories:

Dense Retrieval: This is what we just discussed as a prerequisite. It relies on embeddings. You embed your query, you embed your documents (or chunks of documents), and you find the ones whose embeddings are closest to your query’s embedding. (Figure 8-1 in the book shows this: query -> dense retrieval -> ranked documents).
- What it’s trying to achieve: Find semantically relevant documents from a corpus.
Reranking: Often, search is a pipeline. A first-stage retriever (maybe keyword-based, or a fast dense retriever) gets a bunch of potentially relevant documents. A reranker then takes this smaller set and the original query, and scores the relevance of each document much more carefully, reordering them. (Figure 8-2 shows query + initial results -> reranker -> improved order of results).
- What it’s trying to achieve: Improve the quality and ordering of search results from an initial, possibly less precise, retrieval step.
Retrieval-Augmented Generation (RAG): This is where we combine search with generation. The LLM doesn’t just rely on its internal knowledge; it’s augmented with retrieved information. (Figure 8-3 shows query -> RAG system -> answer + cited sources).
- What it’s trying to achieve: Generate factual, grounded answers by providing the LLM with relevant context from external sources, reducing hallucinations, and enabling “chat with your data” scenarios.

Let’s dive deeper into these.

Semantic Search with Language Models

(Again, imagine Figure 8-4: texts as points in space, similar texts are closer) This is the core idea we’ve built up. Embeddings project text into a space where distance equals dissimilarity. When a user queries, we embed the query into this same space and find the nearest document embeddings (Figure 8-5).

Caveats of Dense Retrieval (page 328-329):
- What if no good results exist? The system might still return the “least bad” ones. We might need a similarity threshold.
- What if the query and best result aren’t truly semantically similar, just share some keywords? This is why embedding models for retrieval are often fine-tuned on question-answer pairs (more on this in Chapter 10).
- Keyword matching is still good for exact phrases. Hybrid search (semantic + keyword) is often best.
- Domain specificity: A model trained on Wikipedia might not do well on legal texts.
Chunking Long Texts (page 330-333):
- Why? LLMs have limited context windows.
- One vector per document? You could embed just the title, or average all chunk embeddings. Not ideal as you lose a lot of specific information.
- Multiple vectors per document (Better!): Chunk the document (sentences, paragraphs, fixed-size, overlapping chunks as shown in Figures 8-7, 8-8, 8-9) and embed each chunk. Your search index then contains chunk embeddings. This allows for more precise retrieval.
  - Strategies include: each sentence, each paragraph, or overlapping chunks to preserve context across chunk boundaries (Figure 8-10).
Nearest Neighbor Search vs. Vector Databases (page 333-334):
- For small archives, calculating all distances (e.g., with NumPy) is fine.
- For millions of vectors, you need optimized Approximate Nearest Neighbor (ANN) search libraries like FAISS or Annoy. They are fast, can use GPUs.
- Vector Databases (e.g., Weaviate, Pinecone, ChromaDB) are even more sophisticated. They allow adding/deleting vectors without rebuilding the whole index, filtering, and more complex querying beyond just vector distance (Figure 8-11).
Fine-tuning Embedding Models for Dense Retrieval (page 334-336):
- Just like in classification (Chapter 4), we can fine-tune embedding models specifically for retrieval.
- Goal: Make embeddings of relevant query-document pairs closer and irrelevant pairs farther.
- Training Data: (Query, Relevant Document) pairs as positive examples, and (Query, Irrelevant Document) pairs as negative examples.
- (Figure 8-12 shows before fine-tuning: “Interstellar release date” and “Interstellar cast” might be equally close to a document about Interstellar’s premiere. Figure 8-13 shows after fine-tuning: “Interstellar release date” is much closer, “Interstellar cast” is pushed away.)

Reranking

This is often a second stage in a search pipeline.

How reranking models work (Figure 8-15): They are often cross-encoders. The query AND a candidate document are fed together into the LLM (like BERT). The model then outputs a relevance score (e.g., 0 to 1). This is more computationally expensive than dense retrieval (where query and documents are embedded separately), so it’s typically done on a smaller, shortlisted set of documents.
Example (page 337-338): The book shows using Cohere’s Rerank endpoint. If a keyword search (BM25) brings up some results, the reranker can significantly improve their order by understanding the semantic relevance more deeply.

Retrieval Evaluation Metrics

How do we know if our search system is good? We need:

A text archive.
A set of queries.
Relevance judgments: For each query, which documents in the archive are actually relevant? (Figure 8-16)

Mean Average Precision (MAP): A popular metric.
- Precision@k: Out of the top k results, how many are relevant?
- Average Precision (AP) for a single query: (Figures 8-20, 8-21, 8-22 show this). It rewards systems that rank relevant documents higher. If a relevant document is at rank 1, AP is 1.0. If it’s at rank 3 (with 2 irrelevant ones before it), AP might be 0.33. If there are multiple relevant documents, it averages the precision at each relevant document’s position.
- Mean Average Precision (MAP): The average of AP scores across all queries in your test set (Figure 8-23). This gives a single number to compare systems.
- Another common metric is nDCG (normalized discounted cumulative gain), which handles graded relevance (some documents can be more relevant than others).

Now, the main event for this session!

Retrieval-Augmented Generation (RAG)

(Imagine Figure 8-24: A diagram showing Question -> 1) Retrieval -> 2) Grounded Generation -> Answer)

This is the industry’s leading method to tackle LLM hallucinations and ground them in specific, up-to-date knowledge.

What it is: A system that first retrieves relevant information from a knowledge source and then uses that information to augment the prompt given to a generative LLM, which then produces the final answer.
Why it’s great:
- Reduces hallucinations.
- Improves factuality.
- Allows LLMs to use information beyond their training data (e.g., internal company documents, recent news).
- Enables “chat with your data” applications.
The Basic RAG Pipeline (Figure 8-24):
1. Retrieval Step: User asks a question. This question is used to query a knowledge base (using dense retrieval, keyword search, or hybrid). The top N relevant document chunks are retrieved.
2. Grounded Generation Step: The original question AND the retrieved document chunks are combined into a new, augmented prompt. This prompt is then fed to a generative LLM to produce the final answer. The LLM is instructed to use the provided context. (Figure 8-25 shows this with sources cited, Figure 8-26 shows the context being added to the prompt).
Example: Grounded Generation with an LLM API (page 351):
- The book shows using Cohere’s co.chat endpoint which has built-in RAG capabilities.
- You provide the message (query) and documents (retrieved chunks).
- The LLM generates an answer and can even provide citations pointing to which parts of the retrieved documents support its answer.
Example: RAG with Local Models (page 352-355):
- This demonstrates the flow if you’re building it yourself.
- Load Generation Model: e.g., a quantized Phi-3 using llama-cpp-python and LangChain.
- Load Embedding Model: e.g., BAAI/bge-small-en-v1.5.
- Create Vector Database: Use FAISS (or ChromaDB, etc.) to index your document chunks with their embeddings.
- The RAG Prompt: This is crucial. It typically looks something like:
```
<|user|>
Relevant information:
{context}  <-- This is where retrieved chunks go

Provide a concise answer the following question using the
relevant information provided above:
{question} <--- This is the original user question
<|end|>
<|assistant|>
```
- LangChain’s RetrievalQA chain can orchestrate this: it takes the LLM, the retriever (from the vector DB), and the prompt template.
Advanced RAG Techniques (page 355-357):
- Query Rewriting: If the user’s question is verbose or conversational (e.g., “I need an essay on dolphins, where do they live?”), an LLM can rewrite it into a more effective search query (“Where do dolphins live”).
- Multi-Query RAG: For questions like “Compare Nvidia’s financial results in 2020 vs. 2023,” the system might generate multiple search queries (“Nvidia 2020 financial results”, “Nvidia 2023 financial results”) and then synthesize the information.
- Multi-Hop RAG: For questions requiring sequential reasoning (e.g., “Who are the largest car manufacturers in 2023? Do they each make EVs?”).
  1. Search: “largest car manufacturers 2023” -> Gets Toyota, VW, Hyundai.
  2. Search: “Toyota electric vehicles”, “VW electric vehicles”, “Hyundai electric vehicles”.
- Query Routing: If you have multiple knowledge bases (e.g., HR documents in Notion, customer data in Salesforce), an LLM can decide which source to query based on the question.
- Agentic RAG: This is where RAG starts to look like the agents we’ll discuss more (or that the book covers in Chapter 7, which you’ve skipped for now!). The LLM becomes more autonomous, deciding which tools (search, specific databases, etc.) to use and in what order. Cohere’s Command R+ is good at this.
RAG Evaluation (page 357-358):
- How do you know your RAG system is good? It’s not just about search relevance.
- The paper “Evaluating verifiability in generative search engines” suggests axes like:
  - Fluency: Is the generated text smooth and cohesive?
  - Perceived Utility: Is the answer helpful and informative?
  - Citation Recall: Are statements supported by the cited sources?
  - Citation Precision: Do the citations actually support the statements they’re linked to?
- LLM-as-a-Judge: Using another capable LLM to evaluate the RAG output.
- Ragas: A software library for this. It looks at:
  - Faithfulness: Is the answer consistent with the provided context?
  - Answer Relevance: Is the answer relevant to the question?

Summary of RAG

RAG is a powerful technique that combines the strengths of information retrieval with the generative capabilities of LLMs.

It retrieves relevant information.
It augments the LLM’s prompt with this information.
It allows the LLM to generate more factual, grounded, and up-to-date responses.
It’s key to reducing hallucinations and making LLMs more trustworthy and useful in real-world applications.

Phew! That was a deep dive into RAG and its foundations. It’s a cornerstone of modern LLM applications. It’s about giving your LLM a library card and teaching it how to read relevant books before answering your question!

What are your thoughts? Does this give you a clearer picture of what RAG is trying to achieve and how it goes about it?