Your AI agent is smart, but it doesn't know your company's products, internal docs, or customer history. It hallucinates when asked about specifics. It confidently makes up answers that sound right but aren't.
RAG (Retrieval-Augmented Generation) fixes this. Instead of relying on the LLM's training data, RAG retrieves relevant information from your own knowledge base and injects it into the prompt. The result: an agent that answers accurately using your actual data.
User asks: "What's the refund policy for enterprise customers?"
Without RAG:
LLM → "Typically, enterprise refund policies vary..." (generic hallucination)
With RAG:
1. Search knowledge base for "refund policy enterprise"
2. Retrieve: "Enterprise customers get 60-day full refund, 90-day pro-rata..."
3. Inject into prompt: "Based on our policy docs: [retrieved text]"
4. LLM → "Enterprise customers receive a 60-day full refund..." (accurate)
A production RAG pipeline has 5 stages:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. INGEST │ → │ 2. CHUNK │ → │ 3. EMBED │
│ Documents │ │ Split text │ │ Vectorize │
└─────────────┘ └──────────────┘ └──────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ 5. GENERATE │ ← │ 4. RETRIEVE │ ← │ Vector DB │
│ LLM answer │ │ Search+Rank │ │ Store │
└─────────────┘ └──────────────┘ └──────────────┘
Load your documents into the pipeline. Common sources:
# Document ingestion with LangChain
from langchain_community.document_loaders import (
PyPDFLoader, UnstructuredMarkdownLoader,
NotionDBLoader, WebBaseLoader
)
# Load from multiple sources
pdf_docs = PyPDFLoader("company_handbook.pdf").load()
md_docs = UnstructuredMarkdownLoader("docs/").load()
web_docs = WebBaseLoader(["https://docs.company.com/faq"]).load()
all_docs = pdf_docs + md_docs + web_docs
Split documents into smaller pieces that fit in the LLM context window. This is where most RAG pipelines succeed or fail.
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | General purpose, simple |
| Semantic | Split at natural boundaries (paragraphs, sections) | Structured documents |
| Recursive | Try paragraph → sentence → character splitting | Mixed-format documents |
| Agentic | LLM decides chunk boundaries | Complex, multi-topic docs |
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # ~200 tokens per chunk
chunk_overlap=100, # overlap prevents losing context at boundaries
separators=["\n\n", "\n", ". ", " "] # try paragraph first, then sentence
)
chunks = splitter.split_documents(all_docs)
print(f"{len(all_docs)} docs → {len(chunks)} chunks")
Convert each chunk into a vector (a list of numbers) that captures its semantic meaning. Similar texts get similar vectors.
| Embedding Model | Dimensions | Cost | Quality |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/M tokens | Good |
| OpenAI text-embedding-3-large | 3072 | $0.13/M tokens | Great |
| Cohere embed-v4 | 1024 | $0.10/M tokens | Great |
| Voyage-3 | 1024 | $0.06/M tokens | Excellent (code) |
| BGE-M3 (local) | 1024 | Free | Very good |
| all-MiniLM-L6 (local) | 384 | Free | Good |
from openai import OpenAI
client = OpenAI()
def embed_texts(texts, model="text-embedding-3-small"):
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
# Embed all chunks
chunk_texts = [chunk.page_content for chunk in chunks]
embeddings = embed_texts(chunk_texts)
# Cost: 100K tokens ≈ $0.002 with text-embedding-3-small
Store embeddings in a vector database, then search by similarity when the agent needs information.
import chromadb
# Store
client = chromadb.PersistentClient(path="./knowledge_base")
collection = client.get_or_create_collection("company_docs")
collection.add(
documents=chunk_texts,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))],
metadatas=[{"source": c.metadata.get("source", "")} for c in chunks]
)
# Retrieve
def search(query, n_results=5):
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results["documents"][0] # List of relevant chunks
# Example
context = search("enterprise refund policy")
# → ["Enterprise customers receive a 60-day full refund...", ...]
Inject retrieved context into the LLM prompt and generate the answer.
def rag_answer(question, knowledge_base):
# Retrieve relevant context
context_chunks = search(question, n_results=5)
context = "\n\n---\n\n".join(context_chunks)
# Generate answer with context
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"""Answer based ONLY on the
provided context. If the context doesn't contain the answer,
say "I don't have that information."
Context:
{context}"""},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Vector search finds semantically similar content, but misses exact keyword matches. Hybrid search combines both:
# Hybrid search: vector similarity + BM25 keyword matching
from rank_bm25 import BM25Okapi
class HybridSearch:
def __init__(self, collection, documents):
self.collection = collection
tokenized = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
self.documents = documents
def search(self, query, n=5, alpha=0.7):
# Vector search (semantic)
vector_results = self.collection.query(
query_texts=[query], n_results=n*2
)
# BM25 search (keyword)
bm25_scores = self.bm25.get_scores(query.split())
bm25_top = sorted(range(len(bm25_scores)),
key=lambda i: bm25_scores[i], reverse=True)[:n*2]
# Combine with weighted score
# alpha controls vector vs keyword weight
combined = self._merge_results(
vector_results, bm25_top, bm25_scores, alpha
)
return combined[:n]
Initial retrieval is fast but rough. A reranker (cross-encoder model) re-scores the top results for much better precision:
# Retrieve 20, rerank to top 5
initial_results = search(query, n_results=20)
# Rerank with Cohere or a cross-encoder
import cohere
co = cohere.Client()
reranked = co.rerank(
query=query,
documents=initial_results,
top_n=5,
model="rerank-english-v3.0"
)
final_context = [r.document for r in reranked.results]
Sometimes the user's question doesn't match the vocabulary in your documents. Query expansion generates alternative queries:
def expand_query(original_query):
prompt = f"""Generate 3 alternative search queries for:
"{original_query}"
Return as JSON array of strings. Focus on synonyms and
different phrasings that might match relevant documents."""
alternatives = llm.call(prompt) # ["...", "...", "..."]
# Search with all queries, merge results
all_results = []
for q in [original_query] + alternatives:
all_results.extend(search(q, n_results=3))
return deduplicate(all_results)
Retrieved chunks often contain irrelevant sentences. Compress them to extract only the relevant parts:
def compress_context(chunks, question):
prompt = f"""Given this question: "{question}"
Extract ONLY the sentences from each chunk that are
directly relevant to answering the question.
Remove everything else.
Chunks:
{chunks}"""
compressed = llm.call(prompt)
return compressed # Much shorter, more relevant context
Instead of a fixed retrieve-then-generate pipeline, let the agent decide when and what to retrieve:
# The agent has search as a tool
tools = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search company docs for specific information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"source_filter": {
"type": "string",
"enum": ["all", "policies", "products", "technical"]
}
}
}
}
}
]
# The agent decides:
# - WHETHER to search (maybe it already knows)
# - WHAT to search for (reformulates the query)
# - HOW MANY times to search (iterative refinement)
# - WHICH sources to filter by
| Database | Type | Best For | Free Tier |
|---|---|---|---|
| ChromaDB | Embedded (local) | Prototyping, small datasets | Open source |
| Qdrant | Embedded or cloud | Performance-critical, Rust speed | Open source + free cloud |
| Weaviate | Embedded or cloud | Hybrid search, multi-modal | Open source + free cloud |
| Pinecone | Cloud only | Managed, zero-ops | Free tier (1 index) |
| pgvector | Postgres extension | Already using Postgres | Open source |
| LanceDB | Embedded (local) | Serverless, multi-modal | Open source |
Large chunks (2000+ tokens) dilute the relevant information with noise. Small chunks (50 tokens) lose context. Sweet spot: 200-400 tokens with 50-token overlap.
If your agent searches all documents equally, it might retrieve a 2-year-old policy when the current one exists. Add metadata (date, source, category) and filter at query time.
results = collection.query(
query_texts=[query],
n_results=5,
where={"source": "current_policies"} # Filter by metadata
)
When the knowledge base doesn't contain the answer, the LLM will hallucinate one. Your system prompt must explicitly handle this: "If the context doesn't contain the answer, say so."
Most teams test the LLM's answer quality but never test whether the right chunks were retrieved. If retrieval is wrong, the answer will be wrong regardless of the LLM.
# Retrieval eval: check if the right chunks are found
def eval_retrieval(test_questions, expected_chunks):
hits = 0
for q, expected in zip(test_questions, expected_chunks):
retrieved = search(q, n_results=5)
if any(exp in ret for exp in expected for ret in retrieved):
hits += 1
return hits / len(test_questions) # Recall@5
A product FAQ and a legal disclaimer have very different importance. Weight your chunks: high-priority content gets boosted in retrieval scoring.
Our AI Agent Playbook includes RAG pipeline templates, chunking configs, and eval frameworks for production agents.
Get the Playbook — $29RAG techniques, agent frameworks, and production patterns. 3x/week, no spam.
Subscribe to AI Agents Weekly