Why RAG Exists
Claude knows everything in its training data. It can explain the French Revolution, debug your Python, and write a sonnet. But it doesn't know your data — your company docs, your database, your internal wiki, your product changelog.
If you ask Claude "What's our refund policy?" it will either make something up or tell you it doesn't have that information. Neither is useful in a real application.
RAG solves this. The name stands for Retrieval-Augmented Generation, which sounds academic but describes something intuitive: first retrieve relevant information from your data, then generate an answer grounded in that information.
This is how every enterprise AI assistant works. Salesforce Einstein, Microsoft Copilot, the AI inside your company's Confluence — all RAG at the core. Today you build one from scratch.
Before you start
Create a folder called knowledge_base in your project directory, then add 3–5 text files (.txt) with content on different topics. For example: refund-policy.txt, pricing.txt, onboarding.txt. Put real paragraphs in them — the more content, the better the demo.
The Simplest RAG — Keyword Search
You don't need a vector database to build RAG. The simplest version works by counting how many query words appear in each document, then feeding the most relevant ones to Claude.
Build this file-by-file and run it after each section:
import anthropic
import os
client = anthropic.Anthropic()
# Load all .txt documents into memory
documents = {}
for fname in os.listdir("knowledge_base"):
if fname.endswith(".txt"):
with open(os.path.join("knowledge_base", fname)) as f:
documents[fname] = f.read()
print(f"Loaded {len(documents)} documents")
def simple_search(query, top_k=3):
"""Find documents containing the most query words"""
query_words = set(query.lower().split())
scores = {}
for name, content in documents.items():
content_lower = content.lower()
score = sum(1 for word in query_words if word in content_lower)
if score > 0:
scores[name] = score
ranked = sorted(scores.items(), key=lambda x: -x[1])[:top_k]
return [(name, documents[name]) for name, _ in ranked]
def ask(question):
# Step 1: Retrieve relevant documents
results = simple_search(question)
if not results:
print("No relevant documents found.")
return
# Step 2: Build context from retrieved docs
context = "\n\n---\n\n".join(
[f"[{name}]\n{content}" for name, content in results]
)
# Step 3: Generate an answer grounded in the context
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="""Answer questions based ONLY on the provided documents.
If the documents don't contain the answer, say so clearly.
Always cite which document you're referencing.""",
messages=[{
"role": "user",
"content": f"Documents:\n{context}\n\nQuestion: {question}"
}]
)
print(message.content[0].text)
# Test it
ask("What is our refund policy?")
ask("How do I get started as a new employee?")
Run it: python rag_simple.py
Notice what's happening in three clean steps: search your docs, stuff the relevant ones into the prompt, ask Claude to answer from them only. That's RAG. No vector database, no external service, no complexity. Just text search plus a good system prompt.
Why the system prompt matters here
The instruction "Answer based ONLY on the provided documents" is what prevents hallucination. Without it, Claude might blend your documents with its training knowledge and produce answers you can't verify. With it, if the document doesn't say it, Claude says so.
The limitation: keyword mismatch
This works well when users happen to use the same words as your documents. But what happens when someone asks "return policy" and your document is titled "refunds and exchanges"? The keywords don't overlap. Score: zero. Document never retrieved. Answer: "No relevant documents found."
That's a real problem in production. Semantic search fixes it.
Embeddings and Semantic Search
Embeddings solve the synonym problem by converting text into a list of numbers — a vector — that represents its meaning. Two texts about the same concept will produce similar vectors, even if they use completely different words.
"refund" and "return" produce nearby vectors. "pizza" and "astrophysics" produce vectors far apart. You measure the distance between vectors using cosine similarity — a number between 0 and 1 where 1 means identical meaning.
Install numpy: pip install numpy
import anthropic
import numpy as np
import json
import os
client = anthropic.Anthropic()
# Load documents (same as before)
documents = {}
for fname in os.listdir("knowledge_base"):
if fname.endswith(".txt"):
with open(os.path.join("knowledge_base", fname)) as f:
documents[fname] = f.read()
def get_embedding(text):
"""
Create a topic-based embedding vector using Claude.
In production, you'd use voyage-3 or text-embedding-3-large.
This approximation works well for learning purposes.
"""
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": (
f"Rate these 10 topics 0-10 for relevance to this text. "
f"Return ONLY a JSON array of 10 numbers, nothing else.\n"
f"Topics: [technology, business, science, health, law, "
f"finance, education, government, engineering, communication]\n"
f"Text: {text[:500]}"
)
}]
)
try:
return np.array(json.loads(message.content[0].text), dtype=float)
except:
return np.zeros(10)
def cosine_similarity(a, b):
"""Measure semantic similarity between two vectors (0 = unrelated, 1 = identical)"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
# Embed all documents upfront (do this once at startup)
print("Embedding documents...")
doc_embeddings = {}
for name, content in documents.items():
doc_embeddings[name] = get_embedding(content)
print(f" ✓ {name}")
print("Ready.\n")
def semantic_search(query, top_k=3, threshold=0.3):
"""Find documents most semantically similar to the query"""
query_emb = get_embedding(query)
scores = [
(name, cosine_similarity(query_emb, emb))
for name, emb in doc_embeddings.items()
]
ranked = sorted(scores, key=lambda x: -x[1])[:top_k]
return [(name, documents[name]) for name, score in ranked if score > threshold]
# Quick test
print("Testing semantic search...")
results = semantic_search("returns and exchanges")
print(f"'returns and exchanges' found: {[r[0] for r in results]}")
Why this embedding approach works
We're asking Claude to rate how much a piece of text relates to 10 broad topics. Two semantically similar texts will get similar ratings across those 10 dimensions, producing similar vectors. It's a simplified approximation — real embedding models (voyage-3, text-embedding-3-large) work on hundreds or thousands of dimensions trained specifically for this task. But the math and the concept are identical.
Understanding cosine similarity
Imagine each document as a point in 10-dimensional space, where each dimension represents how strongly it relates to one topic. Cosine similarity measures the angle between two points — a score near 1 means they're pointing in the same direction (same meaning), near 0 means unrelated.
The threshold=0.3 parameter filters out documents that are likely unrelated. Adjust it lower to be more permissive, higher to be more strict.
Put It All Together
Now combine semantic search with the Claude API call to build the complete RAG system:
# ... (paste the embedding setup from rag_semantic.py above) ...
# Then add this function:
def rag_ask(question):
print(f"Question: {question}\n")
# 1. Retrieve semantically relevant documents
results = semantic_search(question)
print(f"Found {len(results)} relevant document(s)\n")
if not results:
print("No relevant documents found.\n")
return
# 2. Format context with source labels
context = "\n\n---\n\n".join(
[f"Source: {name}\n{content}" for name, content in results]
)
# 3. Generate grounded answer
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="""You are a helpful assistant that answers questions using
ONLY the provided source documents.
Rules:
1. Only use information from the provided documents
2. Cite your sources by document name
3. If the documents don't contain the answer, say:
"I don't have information about that in my knowledge base"
4. Be specific and precise""",
messages=[{
"role": "user",
"content": f"Sources:\n{context}\n\nQuestion: {question}"
}]
)
print("Answer:", message.content[0].text)
print("\n" + "="*60 + "\n")
# Test with several questions
rag_ask("What are our company's AI policies?")
rag_ask("How does the refund process work?")
rag_ask("What laptop do new employees get?")
rag_ask("What is the capital of France?") # Should say it doesn't know
That last test question — "What is the capital of France?" — is important. It should return "I don't have information about that in my knowledge base." If it does, your RAG system is working correctly: it's grounded to your data, not Claude's general knowledge.
Congratulations — you just built a RAG system
This exact architecture — retrieve, then generate — powers the AI assistants at Goldman Sachs, the Department of Defense, and every enterprise chatbot that needs to reason about internal data. The production versions use better embedding models and dedicated vector databases, but the architecture is what you built.
What You'd Do Differently in Production
What you built today is the correct architecture for RAG. What changes in production is the quality and scale of each component:
Production upgrades (don't build these today — just understand them)
- Real vector database: Pinecone, Weaviate, or pgvector. Scales to millions of documents, millisecond retrieval.
- Proper embeddings: voyage-3 (Anthropic's model) or text-embedding-3-large (OpenAI). Hundreds of dimensions trained specifically for semantic search.
- Chunking: Large documents get split into paragraphs before embedding, so you retrieve the right section — not the entire 50-page manual.
- Metadata filtering: "Only search documents from the last 6 months" or "Only search HR policy docs." Narrows the search space before semantic scoring.
- Conversation memory: Keep track of what was said earlier in the conversation. "What else does that policy say?" needs to know what "that policy" refers to.
- Hybrid search: Combine keyword search and semantic search for better recall. Some queries (product SKUs, names, acronyms) are better served by exact match.
This is exactly what we build on Day 2 of the bootcamp — the production version of what you built today. Real vector database, real embeddings, chunking strategy, metadata filters, and evaluation with RAGAS.
What you built today
- Keyword RAG: Load documents, score by word overlap, retrieve top matches, generate grounded answers
- Semantic embeddings: Convert text to topic-score vectors that capture meaning, not just keywords
- Cosine similarity: Measure how semantically related a query is to each document
- Complete RAG pipeline: Retrieve → format context → generate with source constraints
- Grounded answers: Claude answers from your data only — no hallucination from training knowledge
Build a Q&A system on your own knowledge base
Create a knowledge_base folder with 5 text files about different topics you actually care about — your team's documentation, product specs, meeting notes, whatever is useful to you. Then:
- Build the complete RAG system using the code from this lesson
- Ask at least 10 questions — some that should match documents, some that shouldn't
- Test the edge cases: what happens when two documents are relevant? What happens when none are?
- Modify the
thresholdvalue insemantic_search— what changes? - Add a feature: print which document was retrieved alongside each answer, and its similarity score
If you get it working, you have a working AI assistant trained on data Claude has never seen. That's the whole point.
Ready to build the production version?
Day 4 of the bootcamp is three full hours on production RAG: real vector databases, voyage-3 embeddings, chunking strategies, hybrid search, and RAGAS evaluation. You leave with a system that could go live Monday.
Reserve Your Bootcamp Seat