Learn what Retrieval Augmented Generation is, why LLMs hallucinate, and how RAG fixes it. Build a working RAG pipeline with LangChain, ChromaDB, and OpenAI in under an hour.
Build a working RAG pipeline that loads a text document, splits it into chunks, stores embeddings in ChromaDB, retrieves relevant passages for a user question, and generates a grounded answer with GPT-4o-mini. You will understand every piece of the RAG architecture by the end of this lesson.
Every large language model has the same fundamental problem: it only knows what it was trained on. Ask GPT-4 about your company's internal handbook and it will confidently fabricate an answer that sounds plausible but is completely made up. This is hallucination, and it is the single biggest barrier to deploying LLMs in production. Retrieval Augmented Generation (RAG) solves it by giving the model access to your actual data at query time.
The idea is elegantly simple. Instead of asking the LLM to answer from memory, you first search your own documents for relevant passages, then pass those passages into the prompt alongside the user's question. The model generates an answer grounded in real evidence rather than its parametric knowledge. That is RAG in one sentence: search first, then generate.
User asks a question. The model answers from training data only. No access to private docs, no citations, high hallucination risk on domain-specific topics.
User asks a question. The system searches your documents first, retrieves relevant passages, and feeds them to the model. Grounded answers with traceable sources.
Every RAG system has two phases: indexing (done once, ahead of time) and querying (done every time a user asks a question). Understanding this separation is the key to understanding RAG.
Let's install everything you need. We are using LangChain as the orchestration framework, ChromaDB as the vector store (it runs locally, no account needed), and OpenAI for both embeddings and the chat model.
# Create a project directory mkdir rag-course && cd rag-course python -m venv venv source venv/bin/activate # Install dependencies pip install langchain langchain-openai langchain-community chromadb # Set your API key export OPENAI_API_KEY="sk-..." # Verify the install python -c "import langchain; import chromadb; print('Ready')"
Before building, it is worth understanding exactly why hallucination happens so you can appreciate what RAG fixes.
LLMs are trained on massive text corpora to predict the next token. They learn patterns, facts, and reasoning from that data. But they have no mechanism to verify whether a generated statement is true. The model does not "know" things the way a database does — it has compressed statistical patterns from training data. When asked about information it has not seen, or has seen insufficiently, it generates the most statistically likely continuation. That continuation often sounds authoritative but is wrong.
There are three main sources of hallucination:
RAG attacks all three problems by supplying the relevant information directly. The model no longer needs to recall facts from compressed parameters — the facts are right there in the context window.
Now let's build the complete pipeline. We will create a simple knowledge base from a text string, embed it, store it in ChromaDB, and query it. This is a minimal but fully functional RAG system.
In a real system you would load PDFs, web pages, or databases. For this first example, we will use a text string representing a company's remote work policy. This keeps the focus on the RAG mechanics, not the data loading (that is Day 2).
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.documents import Document # Our "knowledge base" — a company remote work policy policy_text = """ Remote Work Policy — Effective January 2026 Section 1: Eligibility All full-time employees who have completed their 90-day probation period are eligible for remote work. Contractors and part-time employees must obtain written approval from their department head. Section 2: Equipment The company provides a laptop, monitor, and keyboard for remote workers. Employees are responsible for their own internet connection, which must be at least 50 Mbps download speed. The IT department will reimburse up to $100/month for internet costs upon submission of receipts. Section 3: Work Hours Remote employees must be available during core hours: 10 AM to 3 PM Eastern Time. Outside of core hours, employees may set their own schedule as long as they complete 40 hours per week. All meetings must be attended with camera on. Section 4: Security All work must be performed on company-issued devices. Personal devices may not be used to access company systems. VPN must be active at all times when accessing internal resources. Two-factor authentication is required for all company accounts. Section 5: Performance Remote employees are evaluated on output, not hours logged. Managers will conduct monthly 1:1 check-ins. If performance declines, the employee may be required to return to the office for a 30-day improvement period. """ # Wrap in a LangChain Document object doc = Document( page_content=policy_text, metadata={"source": "remote-work-policy.pdf", "version": "2026-01"} ) # Split into chunks splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=50, separators=["\n\n", "\n", ". ", " "] ) chunks = splitter.split_documents([doc]) print(f"Split into {len(chunks)} chunks") for i, chunk in enumerate(chunks): print(f"\nChunk {i+1} ({len(chunk.page_content)} chars):") print(chunk.page_content[:100] + "...")
The RecursiveCharacterTextSplitter tries to split at paragraph boundaries first (\n\n), then sentences (. ), then words. The chunk_overlap of 50 characters means adjacent chunks share some text, which preserves context across boundaries. We will explore chunking strategies in detail on Day 2.
from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma # Initialize the embedding model embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Create ChromaDB vector store from our chunks # This embeds all chunks and stores them locally vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, collection_name="company-policies", persist_directory="./chroma_db" ) print(f"Stored {vectorstore._collection.count()} vectors in ChromaDB") # Test a similarity search results = vectorstore.similarity_search( "What equipment does the company provide?", k=3 ) print("\nTop 3 results:") for r in results: print(f"- {r.page_content[:80]}...")
Now connect retrieval to generation. This is where the magic happens — the LLM answers using your documents, not its training data.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # Load the existing vector store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=embeddings, collection_name="company-policies" ) # Create a retriever (returns top 3 chunks) retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # The RAG prompt template rag_prompt = ChatPromptTemplate.from_template("""Answer the question based ONLY on the following context. If the context doesn't contain the answer, say "I don't have that information." Context: {context} Question: {question} Answer:""") # Helper to format retrieved docs into a single string def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) # The RAG chain using LCEL model = ChatOpenAI(model="gpt-4o-mini", temperature=0) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | model | StrOutputParser() ) # Ask questions! questions = [ "What internet speed is required for remote work?", "Can contractors work remotely?", "What are the core hours?", "What happens if performance declines?", ] for q in questions: answer = rag_chain.invoke(q) print(f"\nQ: {q}") print(f"A: {answer}")
Look at the chain construction carefully. The retriever | format_docs pipeline takes the user's question, searches ChromaDB, and formats the results. RunnablePassthrough() passes the question through unchanged. Both feed into the prompt template, which goes to the model, which goes to the output parser. That is the entire RAG pipeline in one expression.
temperature=0 when you want factual, grounded answers. Higher temperatures introduce randomness, which is the opposite of what you want when the model should stick to retrieved context.When you run this code, you will see answers that are directly traceable to the policy text. Ask "What internet speed is required?" and the answer will say "50 Mbps download speed" — because that exact phrase is in the retrieved context. Ask "What is the company's stock price?" and the model should respond "I don't have that information" because nothing in the policy mentions stock prices.
This is the fundamental value proposition of RAG: the model can only answer from what it retrieves. No retrieval, no answer. That constraint is what makes RAG trustworthy.
But RAG is not magic. The quality of your answers depends entirely on:
Production RAG systems need citations so users can verify answers. Here is how to return the source documents alongside the answer:
from langchain_core.runnables import RunnableParallel # Chain that returns both the answer and source documents rag_chain_with_sources = RunnableParallel( { "answer": ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | model | StrOutputParser() ), "sources": retriever } ) result = rag_chain_with_sources.invoke( "Does the company reimburse internet costs?" ) print(f"Answer: {result['answer']}\n") print("Sources:") for doc in result["sources"]: print(f" - [{doc.metadata['source']}] {doc.page_content[:60]}...")
RunnableParallel runs both branches at the same time — one generates the answer, the other returns the raw retrieved documents. The caller gets a dictionary with both the answer and the source chunks that produced it.
Before moving on, confirm you understand these core concepts:
RecursiveCharacterTextSplitter use chunk_overlap?similarity_search actually do under the hood (embed query, cosine similarity, top-k)?temperature=0 for RAG chains?RunnablePassthrough work in the LCEL chain?