RAG Tutorial 2026: Build a Retrieval-Augmented Generation System in Python

In This Article

  1. What Is RAG and Why Does It Matter?
  2. RAG vs Fine-Tuning: When to Use Each
  3. RAG Architecture: The Full Pipeline
  4. Step 1: Loading Documents with LangChain
  5. Step 2: Chunking Strategies
  6. Step 3: Creating Embeddings
  7. Step 4: Storing in ChromaDB or Pinecone
  8. Step 5: Querying and Generating Answers
  9. Advanced RAG: Hybrid Search, Reranking, Query Expansion
  10. Evaluating RAG Quality with RAGAS
  11. RAG for Government and Enterprise
  12. Frequently Asked Questions

Key Takeaways

Plain LLMs are impressive. They can write, reason, summarize, and explain. But they have one fundamental problem that makes them unreliable for enterprise work: they only know what was in their training data. Ask a model about your company's Q4 2025 policy update, a contract you signed last week, or a regulation that changed after its cutoff — and it will either confess ignorance or, worse, confidently fabricate an answer.

Retrieval-Augmented Generation solves this. RAG connects an LLM to your own documents at query time, retrieving the most relevant passages and injecting them as verified context before the model generates a response. The model cannot hallucinate facts it has been explicitly given. The result is an AI system that is simultaneously powerful and grounded — and in 2026, it is the foundational architecture for almost every production-grade enterprise AI application.

This tutorial builds a complete RAG system from scratch in Python. We cover every step: loading documents, chunking strategies, creating embeddings, storing in a vector database, retrieving and generating, and evaluating quality with the RAGAS framework. By the end, you will have working, production-ready code you can adapt to your own documents.

73%
of enterprise AI deployments in 2026 use RAG or a RAG-hybrid architecture
10x
lower cost to update a RAG system versus retraining or fine-tuning a model
85%
reduction in hallucination rate when RAG is properly implemented vs plain prompting

What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation was introduced in a 2020 Meta paper and has since become the dominant pattern for production LLM applications. The core idea is deceptively simple: instead of relying only on knowledge baked into model weights, the system retrieves relevant information from an external knowledge base and gives it to the model as context.

The pipeline has two phases. At indexing time, you load your documents, split them into chunks, convert each chunk into a numerical vector (an embedding), and store those vectors in a vector database. At query time, you convert the user's question into a vector, find the chunks with the most similar vectors, and pass those chunks plus the question to the LLM as a prompt. The LLM generates its answer using the retrieved context — not just its training data.

"RAG turns a general-purpose language model into a domain expert on your specific documents — without retraining anything."

Why does this matter more in 2026 than it did even two years ago? Three reasons. First, the volume of enterprise documents that organizations want to make queryable has exploded — policies, contracts, technical manuals, research reports, emails. Second, model context windows have grown large enough to accommodate meaningful retrieved context without degrading quality. Third, the tooling (LangChain, LlamaIndex, ChromaDB, Pinecone) has matured to the point where you can build a production-quality RAG system in a day rather than a month.

RAG vs Fine-Tuning: When to Use Each

This is the most common architectural question teams face when building LLM applications. The answer is not always RAG — but it usually is for enterprise and government use cases. Here is the honest comparison.

Dimension RAG Fine-Tuning
Primary use case ✓ Q&A over specific documents ⚠ Style, tone, task format changes
Keeps data current ✓ Add/update docs any time ✗ Requires retraining for new data
Works with confidential data ✓ Data never enters model weights ⚠ Data used in training run
Auditability ✓ Can cite source documents ✗ No traceability to source
Implementation cost ✓ Days to weeks ✗ Weeks to months, plus GPU cost
Update cost ✓ Re-index new documents only ✗ Partial or full retrain required
Reduces hallucination ✓ Strongly, on in-context topics ⚠ Only for memorized facts
Changes model behavior ✗ Generator model unchanged ✓ Can reshape output format/style
Best for compliance ✓ Traceable, auditable responses ⚠ Harder to audit
Can be combined ✓ Yes — fine-tuned model as generator ✓ Yes — RAG as retrieval layer

Rule of Thumb for 2026

Default to RAG. Use fine-tuning only when you need the model to produce a specific output format, adopt a domain-specific vocabulary consistently, or perform a structured classification/extraction task where prompt engineering alone fails. For most document Q&A, policy lookup, and knowledge base applications — RAG is faster, cheaper, safer, and more auditable.

RAG Architecture: The Full Pipeline

Before writing a single line of code, understand the two distinct phases and the six steps within them. Most RAG bugs come from misunderstanding which phase a component belongs to.

1

Document Loading

Ingest raw files — PDFs, Word docs, HTML pages, Markdown, plain text — and convert them to a uniform text format. LangChain and LlamaIndex both provide loaders for every common format.

2

Chunking

Split long documents into smaller passages that fit within embedding model limits and carry a coherent unit of meaning. Chunk size and overlap choices have a larger impact on RAG quality than most teams expect.

3

Embedding

Convert each chunk into a dense numerical vector using an embedding model. Similar passages will have similar vectors — this is what makes semantic search possible. OpenAI's text-embedding-3-small is the standard choice; sentence-transformers work well for on-premise or cost-sensitive deployments.

4

Vector Storage

Store the chunk text and its embedding vector in a vector database. At query time the database performs approximate nearest-neighbor search to find the most relevant chunks in milliseconds.

5

Retrieval

At query time, embed the user's question using the same model, search the vector store for the top-k most similar chunks, and return them as context. Advanced retrieval adds hybrid search and reranking here.

6

Generation

Build a prompt that includes the retrieved chunks and the user's question, and call the LLM (Claude, GPT-4, Gemini, or a local model). The model answers using only the provided context, dramatically reducing hallucination.

Step 1: Loading Documents with LangChain

LangChain's document loaders handle the messy work of parsing different file formats into a consistent Document object with page_content and metadata. Start by installing the core dependencies.

bash — install dependencies
pip install langchain langchain-community langchain-openai \ chromadb pypdf sentence-transformers ragas openai tiktoken

Loading PDFs — the most common enterprise document format — is two lines of code. The loader preserves page numbers in metadata automatically, which is useful for citation later.

python — load pdfs and web pages
from langchain_community.document_loaders import ( PyPDFLoader, DirectoryLoader, WebBaseLoader, UnstructuredWordDocumentLoader, ) # Load a single PDF loader = PyPDFLoader("policy_manual.pdf") docs = loader.load() print(f"Loaded {len(docs)} pages") # Each doc.metadata includes: {'source': 'policy_manual.pdf', 'page': 0} # Load all PDFs in a directory dir_loader = DirectoryLoader( "./documents/", glob="**/*.pdf", loader_cls=PyPDFLoader, show_progress=True, ) all_docs = dir_loader.load() # Load a web page web_loader = WebBaseLoader("https://example.com/policy") web_docs = web_loader.load() # Load a Word document word_loader = UnstructuredWordDocumentLoader("contract.docx") word_docs = word_loader.load()

For large document collections, load in batches and persist to disk so you do not re-process documents on every run. The LangChain DirectoryLoader accepts any loader class, so the same pattern works for Word docs, CSV files, HTML, and plain text by swapping the loader_cls.

Step 2: Chunking Strategies

Chunking is where most RAG systems fail silently: 256-512 token chunks with 50-100 token overlap is the correct default for most document types. Smaller chunks (128 tokens) work better for precise factual Q&A; larger chunks (1024 tokens) work better for thematic summarization. RecursiveCharacterTextSplitter is LangChain's default; SemanticChunker splits on meaning boundaries and outperforms it on narrative or legal text at a slightly higher embedding cost.

Fixed-Size Chunking

The simplest strategy: split every document into chunks of exactly N tokens, with M tokens of overlap between adjacent chunks. The overlap prevents answers that span a chunk boundary from being missed. A good starting point for most document types is 512 tokens with 64 tokens of overlap.

python — fixed-size and recursive character splitting
from langchain.text_splitter import ( RecursiveCharacterTextSplitter, TokenTextSplitter, ) # Token-based fixed-size splitter (most reliable for embedding models) token_splitter = TokenTextSplitter( chunk_size=512, chunk_overlap=64, ) token_chunks = token_splitter.split_documents(all_docs) # Recursive character splitter — tries to split on paragraphs first, # then sentences, then words. Produces more semantically coherent chunks. recursive_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=100, separators=["\n\n", "\n", ". ", " ", ""], length_function=len, ) recursive_chunks = recursive_splitter.split_documents(all_docs) print(f"Fixed-size: {len(token_chunks)} chunks") print(f"Recursive: {len(recursive_chunks)} chunks") print(f"Sample chunk:\n{recursive_chunks[5].page_content[:300]}")

Semantic Chunking

Semantic chunking uses embeddings to detect natural topic boundaries in the text and splits there instead of at fixed character counts. It produces the most coherent chunks but is significantly slower — appropriate for offline indexing where quality matters more than speed.

python — semantic chunking with langchain
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") semantic_splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", # or "standard_deviation" breakpoint_threshold_amount=95, # split when cosine distance > 95th pctile ) semantic_chunks = semantic_splitter.split_documents(all_docs[:10]) # sample print(f"Semantic chunks: {len(semantic_chunks)}")

Chunking Decision Guide

Step 3: Creating Embeddings

Embeddings are the numerical representations that make semantic search possible. Two sentences with the same meaning will have similar embedding vectors even if they share no words. Two sentences on different topics will have vectors that are far apart in the embedding space.

python — openai embeddings and sentence-transformers
from langchain_openai import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceEmbeddings import os os.environ["OPENAI_API_KEY"] = "your-key-here" # Option 1: OpenAI embeddings (best quality, requires API key) openai_embeddings = OpenAIEmbeddings( model="text-embedding-3-small", # 1536 dimensions, cheap # model="text-embedding-3-large", # 3072 dimensions, better quality ) # Quick test test_vector = openai_embeddings.embed_query("What is the refund policy?") print(f"Embedding dimensions: {len(test_vector)}") # 1536 # Option 2: sentence-transformers (free, runs locally, no API needed) local_embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-large-en-v1.5", # top-performing open model model_kwargs={"device": "cpu"}, # use "cuda" if GPU available encode_kwargs={"normalize_embeddings": True}, ) # For air-gapped or on-prem government deployments, local embeddings # are often required. BAAI/bge-large-en-v1.5 and all-MiniLM-L6-v2 # are the most commonly used open-source embedding models in 2026.

Step 4: Storing in ChromaDB or Pinecone

With your chunks and embedding model ready, you can build the vector index. ChromaDB runs locally with zero infrastructure — ideal for development. Pinecone is the managed production choice for most teams. Both integrate identically with LangChain.

python — chromadb (local development)
from langchain_community.vectorstores import Chroma # Build the index from chunks — this embeds every chunk and stores results # Takes a few minutes for large document sets; results persist to disk vectorstore = Chroma.from_documents( documents=recursive_chunks, embedding=openai_embeddings, persist_directory="./chroma_db", # omit for in-memory only collection_name="company_docs", ) print(f"Indexed {vectorstore._collection.count()} chunks") # Load an existing index from disk (subsequent runs) vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=openai_embeddings, collection_name="company_docs", )
python — pinecone (production)
from langchain_pinecone import PineconeVectorStore from pinecone import Pinecone, ServerlessSpec import os os.environ["PINECONE_API_KEY"] = "your-pinecone-key" # Create index (one-time setup) pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) if "company-docs" not in pc.list_indexes().names(): pc.create_index( name="company-docs", dimension=1536, # match your embedding model metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), ) # Index documents vectorstore = PineconeVectorStore.from_documents( documents=recursive_chunks, embedding=openai_embeddings, index_name="company-docs", ) # Or connect to existing index vectorstore = PineconeVectorStore( index_name="company-docs", embedding=openai_embeddings, )

Step 5: Querying and Passing Context to the LLM

The full retrieval-generation pipeline is where it all comes together. LangChain's RetrievalQA and ConversationalRetrievalChain handle the plumbing — embedding the query, retrieving chunks, building the prompt, and calling the LLM. The code below works with Claude or GPT-4 by swapping one line.

python — full rag chain with claude and gpt-4
from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI from langchain_anthropic import ChatAnthropic from langchain.prompts import PromptTemplate # Custom prompt — forces the model to answer only from context RAG_PROMPT = PromptTemplate( input_variables=["context", "question"], template="""You are a helpful assistant. Answer the question using ONLY the context provided below. If the answer is not in the context, say "I don't have information on that in the provided documents." Context: {context} Question: {question} Answer:""" ) # Use Claude as the generator (recommended for enterprise) llm_claude = ChatAnthropic( model="claude-opus-4-5", anthropic_api_key="your-anthropic-key", temperature=0, # deterministic answers for enterprise Q&A ) # Or use GPT-4 llm_gpt = ChatOpenAI(model="gpt-4o", temperature=0) # Build the RAG chain rag_chain = RetrievalQA.from_chain_type( llm=llm_claude, chain_type="stuff", # "stuff" = all chunks in one prompt retriever=vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5}, # retrieve top 5 chunks ), chain_type_kwargs={"prompt": RAG_PROMPT}, return_source_documents=True, # include source citations ) # Ask a question result = rag_chain.invoke({"query": "What is the employee refund policy?"}) print("Answer:", result["result"]) print("\nSources:") for doc in result["source_documents"]: print(f" - {doc.metadata.get('source', 'unknown')}, " f"page {doc.metadata.get('page', 'N/A')}")

The return_source_documents=True flag is critical for enterprise and government deployments. Every answer comes with the exact document passages that support it — enabling auditors, compliance teams, and end users to verify the answer against source material.

Advanced RAG: Hybrid Search, Reranking, Query Expansion

Three techniques take a basic RAG system from demo quality to production quality: hybrid search (combine BM25 keyword matching with vector similarity — captures both exact term matches and semantic similarity), reranking (use a cross-encoder like Cohere Rerank to re-score the top-20 retrieved chunks and keep only the top-5 most relevant), and query expansion (rephrase the user's question multiple ways before retrieving — improves recall for ambiguous or poorly-worded queries).

Hybrid Search (Dense + Sparse)

Pure vector search misses exact keyword matches — critical when users ask about specific product names, regulation codes, or proper nouns that may not encode well semantically. Hybrid search combines vector similarity (dense retrieval) with BM25 keyword search (sparse retrieval), then merges the results. This is the standard approach at production scale.

python — hybrid search with pinecone
from langchain_community.retrievers import PineconeHybridSearchRetriever from pinecone_text.sparse import BM25Encoder # Fit BM25 on your corpus bm25_encoder = BM25Encoder().default() bm25_encoder.fit([doc.page_content for doc in recursive_chunks]) bm25_encoder.dump("bm25_values.json") # Hybrid retriever — alpha=0.5 weights dense and sparse equally # alpha=0.75 biases toward semantic; alpha=0.25 biases toward keyword hybrid_retriever = PineconeHybridSearchRetriever( embeddings=openai_embeddings, sparse_encoder=bm25_encoder, index=pc.Index("company-docs"), top_k=5, alpha=0.5, )

Reranking

Vector search retrieves the top-k candidates, but similarity ranking is imperfect. A cross-encoder reranker reads the full query and each candidate chunk together and assigns a more accurate relevance score. This two-stage approach (fast ANN search followed by precise reranking) is used in production by most high-quality RAG deployments.

python — reranking with cohere or sentence-transformers
from langchain.retrievers import ContextualCompressionRetriever from langchain_cohere import CohereRerank from langchain_community.cross_encoders import HuggingFaceCrossEncoder from langchain.retrievers.document_compressors import CrossEncoderReranker # Option 1: Cohere Rerank API (easiest, production-grade) cohere_reranker = CohereRerank( cohere_api_key="your-cohere-key", model="rerank-english-v3.0", top_n=3, ) compression_retriever = ContextualCompressionRetriever( base_compressor=cohere_reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}), ) # Option 2: Local cross-encoder (free, works on-prem) cross_encoder = HuggingFaceCrossEncoder( model_name="cross-encoder/ms-marco-MiniLM-L-6-v2" ) reranker = CrossEncoderReranker(model=cross_encoder, top_n=3) local_compression_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}), ) # Use exactly like a normal retriever docs = compression_retriever.invoke("What are the security clearance requirements?")

Query Expansion

Short or ambiguous queries often miss relevant chunks. Query expansion uses the LLM to generate multiple phrasings of the same question, retrieves chunks for each, deduplicates, and merges. This increases recall substantially for enterprise queries where users do not know the exact terminology used in the source documents.

python — multi-query retriever for query expansion
from langchain.retrievers.multi_query import MultiQueryRetriever multi_query_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), llm=llm_claude, # By default, generates 3 alternative phrasings of the query. # All results are deduplicated before being passed to the generator. ) # This single call triggers 3 sub-queries internally docs = multi_query_retriever.invoke( "How does the company handle employee complaints?" ) print(f"Retrieved {len(docs)} unique chunks via query expansion")

Evaluating RAG Quality with RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) evaluates RAG systems across four dimensions: faithfulness (is the answer grounded in the retrieved chunks?), answer relevancy (does the answer actually address the question?), context precision (are the retrieved chunks actually relevant?), and context recall (did retrieval find all the chunks needed to answer?). A production RAG system should score above 0.7 on all four before going live.

python — ragas evaluation pipeline
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Build an evaluation dataset # question: what the user asked # answer: what your RAG system returned # contexts: the chunks that were retrieved # ground_truth: the correct answer (optional, needed for context_recall) eval_data = { "question": [ "What is the maximum reimbursable meal allowance per day?", "Who approves travel requests over $5,000?", "What is the notice period for contract termination?", ], "answer": [answers], # your RAG system's outputs "contexts": [contexts], # list of retrieved chunk strings per question "ground_truth": [truths], # correct answers from your SME } dataset = Dataset.from_dict(eval_data) # Run evaluation — uses Claude/GPT as judge for faithfulness and relevancy results = evaluate( dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm_claude, embeddings=openai_embeddings, ) print(results) # faithfulness: 0.91 (answers stick to retrieved context) # answer_relevancy: 0.87 (answers address the actual question) # context_precision: 0.83 (retrieved chunks are relevant) # context_recall: 0.79 (retrieved chunks contain the answer) df = results.to_pandas() df.to_csv("rag_eval_results.csv", index=False)
0.85+
Faithfulness score target before shipping a RAG system to production users
Below 0.80 indicates significant hallucination — revisit chunking strategy or retrieval count (k)

If your faithfulness score is low, the most common causes are: chunks that are too large (diluting relevance), retrieving too many chunks (adding noise), or a prompt that does not strongly constrain the model to use only the provided context. If context precision is low, your embedding model or chunking strategy is producing poor matches — try semantic chunking or a stronger embedding model. If context recall is low, increase k or add query expansion.

RAG for Government and Enterprise Use Cases

The same Python code you built above powers some of the highest-value AI applications deployed in government and enterprise today. RAG is not a toy — it is the architecture behind contract analysis systems, policy Q&A assistants, regulatory compliance tools, and intelligence report summarization platforms.

High-Value RAG Applications in 2026

Security Considerations for Government Deployments

For government and regulated industry deployments, three additional requirements shape the architecture. First, air-gapped environments require local embedding models (BAAI/bge-large or all-MiniLM-L6-v2) and locally hosted LLMs (Llama 3, Mistral, or a NIST-approved deployment of a commercial model). No data should leave the network boundary. Second, chunk-level access control is required when different users have clearances for different document sets — tag each chunk's metadata with its classification level and filter at retrieval time. Third, every response must be auditable: log the query, the retrieved chunk IDs, and the model response with a timestamp for compliance review.

On-Premise RAG Stack for Government

The entire stack described in this tutorial — from document loading through RAGAS evaluation — can run 100% on-premise with no external API dependencies. Swap OpenAIEmbeddings for HuggingFaceEmbeddings, swap ChatOpenAI for ChatOllama, and replace Pinecone with Qdrant. The Python interfaces are identical.

Build RAG systems in three days.

Precision AI Academy's hands-on bootcamp teaches RAG, agents, and AI integration from working code to deployed application. Small cohort. Real projects. Five cities, October 2026.

Reserve Your Seat

Denver · New York City · Dallas · Los Angeles · Chicago · $1,490 · October 2026

The bottom line: Building a production-grade RAG system in Python requires five steps: load documents with LangChain's document loaders, chunk with RecursiveCharacterTextSplitter (256-512 tokens, 50 overlap), embed with OpenAI text-embedding-3-small or a local HuggingFace model, store in ChromaDB (local) or Pinecone (production), and query with a RetrievalQA chain. Evaluate with RAGAS before going live. The entire stack can run fully on-premise by swapping cloud services for local equivalents — no API dependencies required.

Frequently Asked Questions

What is RAG and why is it better than just using an LLM?

RAG (Retrieval-Augmented Generation) grounds an LLM's answers in your specific documents instead of relying on the model's frozen training data. A plain LLM will hallucinate facts it does not know, cannot reference documents that post-date its training cutoff, and has no access to proprietary or confidential content. RAG solves all three problems by retrieving relevant passages from your own document store at query time and injecting them as verified context. For enterprise use cases — policy documents, technical manuals, legal contracts, government regulations — RAG is the correct architecture in nearly every situation.

Should I use RAG or fine-tuning for my use case?

Default to RAG. Use fine-tuning only when you need the model to produce a specific output format consistently, adopt domain-specific terminology that does not appear in source documents, or perform a structured task (classification, extraction) where prompt engineering alone fails. For most document Q&A, policy lookup, and knowledge base applications, RAG is faster to implement, cheaper to update, easier to audit, and safer with confidential data. Fine-tuning and RAG can also be combined — use a fine-tuned model as the generator in your RAG pipeline.

What is the best vector database for RAG in 2026?

For local development and prototyping, ChromaDB is the standard choice — it runs in-process with zero infrastructure setup and has excellent LangChain integration. For production deployments, Pinecone is the most widely used managed service with strong filtering and hybrid search support. Weaviate and Qdrant are strong open-source alternatives with more deployment flexibility and better support for on-premise government requirements. For teams already in the AWS ecosystem, OpenSearch with its vector engine is a natural fit. Start with ChromaDB locally and migrate when you need production scale.

How do I evaluate whether my RAG system is working well?

Use the RAGAS framework. It measures four dimensions without requiring human-labeled ground truth for most metrics: Context Precision (are the retrieved chunks relevant?), Context Recall (does the retrieved context contain the answer?), Faithfulness (does the generated answer stick to the retrieved context?), and Answer Relevance (does the answer address the question?). Run RAGAS evaluations after any change to chunking strategy, embedding model, or retrieval configuration. Target Faithfulness above 0.85 and Context Precision above 0.80 before shipping to production users.

Sources: World Economic Forum Future of Jobs Report 2025, AI.gov — National AI Initiative, McKinsey State of AI 2025

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies. Former university instructor specializing in practical AI tools for non-programmers. Kaggle competitor and builder of production AI systems. He founded Precision AI Academy to bridge the gap between AI theory and real-world professional application.

Explore More Guides