Day 01 Foundations

What RAG Is & Your First RAG App

Learn what Retrieval Augmented Generation is, why LLMs hallucinate, and how RAG fixes it. Build a working RAG pipeline with LangChain, ChromaDB, and OpenAI in under an hour.

~1 hour Hands-on Precision AI Academy

Today's Objective

Build a working RAG pipeline that loads a text document, splits it into chunks, stores embeddings in ChromaDB, retrieves relevant passages for a user question, and generates a grounded answer with GPT-4o-mini. You will understand every piece of the RAG architecture by the end of this lesson.

Every large language model has the same fundamental problem: it only knows what it was trained on. Ask GPT-4 about your company's internal handbook and it will confidently fabricate an answer that sounds plausible but is completely made up. This is hallucination, and it is the single biggest barrier to deploying LLMs in production. Retrieval Augmented Generation (RAG) solves it by giving the model access to your actual data at query time.

The idea is elegantly simple. Instead of asking the LLM to answer from memory, you first search your own documents for relevant passages, then pass those passages into the prompt alongside the user's question. The model generates an answer grounded in real evidence rather than its parametric knowledge. That is RAG in one sentence: search first, then generate.

Without RAG

Pure LLM

User asks a question. The model answers from training data only. No access to private docs, no citations, high hallucination risk on domain-specific topics.

With RAG

LLM + Retrieval

User asks a question. The system searches your documents first, retrieves relevant passages, and feeds them to the model. Grounded answers with traceable sources.

01

The RAG Architecture

Every RAG system has two phases: indexing (done once, ahead of time) and querying (done every time a user asks a question). Understanding this separation is the key to understanding RAG.

Indexing Phase

  1. Load documents — Read your source data (PDFs, web pages, databases, text files).
  2. Split into chunks — Break documents into smaller pieces (typically 500–1000 tokens). LLMs have context limits, and smaller chunks mean more precise retrieval.
  3. Generate embeddings — Convert each chunk into a vector (a list of numbers) that captures its semantic meaning. Similar content produces similar vectors.
  4. Store in a vector database — Save the vectors and their associated text in a database optimized for similarity search (ChromaDB, Pinecone, Weaviate, etc.).

Query Phase

  1. Embed the question — Convert the user's question into a vector using the same embedding model.
  2. Similarity search — Find the chunks whose vectors are most similar to the question vector.
  3. Build the prompt — Combine the retrieved chunks with the user's question in a prompt template.
  4. Generate answer — Send the augmented prompt to the LLM. The model answers using the retrieved context.
Why not just stuff everything into the prompt? Context windows are large now (128K+ tokens for GPT-4o), but stuffing everything in is slow, expensive, and dilutes relevance. RAG retrieves only the most relevant passages, keeping costs low and accuracy high. A 100-page manual might have 200 chunks, but only 3–5 are relevant to any given question.
02

Environment Setup

Let's install everything you need. We are using LangChain as the orchestration framework, ChromaDB as the vector store (it runs locally, no account needed), and OpenAI for both embeddings and the chat model.

setup.sh
bash
# Create a project directory
mkdir rag-course && cd rag-course
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install langchain langchain-openai langchain-community chromadb

# Set your API key
export OPENAI_API_KEY="sk-..."

# Verify the install
python -c "import langchain; import chromadb; print('Ready')"
Cost note: OpenAI embeddings (text-embedding-3-small) cost $0.02 per million tokens. For this course, you will spend less than $0.10 total. The chat model (gpt-4o-mini) is similarly cheap at $0.15 per million input tokens.
03

Why LLMs Hallucinate

Before building, it is worth understanding exactly why hallucination happens so you can appreciate what RAG fixes.

LLMs are trained on massive text corpora to predict the next token. They learn patterns, facts, and reasoning from that data. But they have no mechanism to verify whether a generated statement is true. The model does not "know" things the way a database does — it has compressed statistical patterns from training data. When asked about information it has not seen, or has seen insufficiently, it generates the most statistically likely continuation. That continuation often sounds authoritative but is wrong.

There are three main sources of hallucination:

RAG attacks all three problems by supplying the relevant information directly. The model no longer needs to recall facts from compressed parameters — the facts are right there in the context window.

04

Building Your First RAG Pipeline

Now let's build the complete pipeline. We will create a simple knowledge base from a text string, embed it, store it in ChromaDB, and query it. This is a minimal but fully functional RAG system.

Step 1: Prepare the Documents

In a real system you would load PDFs, web pages, or databases. For this first example, we will use a text string representing a company's remote work policy. This keeps the focus on the RAG mechanics, not the data loading (that is Day 2).

rag_basic.py
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Our "knowledge base" — a company remote work policy
policy_text = """
Remote Work Policy — Effective January 2026

Section 1: Eligibility
All full-time employees who have completed their 90-day probation period
are eligible for remote work. Contractors and part-time employees must
obtain written approval from their department head.

Section 2: Equipment
The company provides a laptop, monitor, and keyboard for remote workers.
Employees are responsible for their own internet connection, which must
be at least 50 Mbps download speed. The IT department will reimburse up
to $100/month for internet costs upon submission of receipts.

Section 3: Work Hours
Remote employees must be available during core hours: 10 AM to 3 PM
Eastern Time. Outside of core hours, employees may set their own
schedule as long as they complete 40 hours per week. All meetings
must be attended with camera on.

Section 4: Security
All work must be performed on company-issued devices. Personal devices
may not be used to access company systems. VPN must be active at all
times when accessing internal resources. Two-factor authentication is
required for all company accounts.

Section 5: Performance
Remote employees are evaluated on output, not hours logged. Managers
will conduct monthly 1:1 check-ins. If performance declines, the
employee may be required to return to the office for a 30-day
improvement period.
"""

# Wrap in a LangChain Document object
doc = Document(
    page_content=policy_text,
    metadata={"source": "remote-work-policy.pdf", "version": "2026-01"}
)

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents([doc])

print(f"Split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk.page_content)} chars):")
    print(chunk.page_content[:100] + "...")

The RecursiveCharacterTextSplitter tries to split at paragraph boundaries first (\n\n), then sentences (. ), then words. The chunk_overlap of 50 characters means adjacent chunks share some text, which preserves context across boundaries. We will explore chunking strategies in detail on Day 2.

Step 2: Create the Vector Store

rag_basic.py (continued)
python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize the embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create ChromaDB vector store from our chunks
# This embeds all chunks and stores them locally
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="company-policies",
    persist_directory="./chroma_db"
)

print(f"Stored {vectorstore._collection.count()} vectors in ChromaDB")

# Test a similarity search
results = vectorstore.similarity_search(
    "What equipment does the company provide?",
    k=3
)

print("\nTop 3 results:")
for r in results:
    print(f"- {r.page_content[:80]}...")
What just happened: ChromaDB took each chunk, sent it to OpenAI's embedding API to get a 1536-dimensional vector, and stored the vector alongside the original text. When you search, it embeds your query the same way and finds chunks whose vectors are closest (cosine similarity).

Step 3: Build the RAG Chain

Now connect retrieval to generation. This is where the magic happens — the LLM answers using your documents, not its training data.

rag_chain.py
python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load the existing vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="company-policies"
)

# Create a retriever (returns top 3 chunks)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# The RAG prompt template
rag_prompt = ChatPromptTemplate.from_template("""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have that information."

Context:
{context}

Question: {question}

Answer:""")

# Helper to format retrieved docs into a single string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# The RAG chain using LCEL
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | model
    | StrOutputParser()
)

# Ask questions!
questions = [
    "What internet speed is required for remote work?",
    "Can contractors work remotely?",
    "What are the core hours?",
    "What happens if performance declines?",
]

for q in questions:
    answer = rag_chain.invoke(q)
    print(f"\nQ: {q}")
    print(f"A: {answer}")

Look at the chain construction carefully. The retriever | format_docs pipeline takes the user's question, searches ChromaDB, and formats the results. RunnablePassthrough() passes the question through unchanged. Both feed into the prompt template, which goes to the model, which goes to the output parser. That is the entire RAG pipeline in one expression.

Temperature 0 for RAG. Set temperature=0 when you want factual, grounded answers. Higher temperatures introduce randomness, which is the opposite of what you want when the model should stick to retrieved context.
05

Understanding the Output

When you run this code, you will see answers that are directly traceable to the policy text. Ask "What internet speed is required?" and the answer will say "50 Mbps download speed" — because that exact phrase is in the retrieved context. Ask "What is the company's stock price?" and the model should respond "I don't have that information" because nothing in the policy mentions stock prices.

This is the fundamental value proposition of RAG: the model can only answer from what it retrieves. No retrieval, no answer. That constraint is what makes RAG trustworthy.

But RAG is not magic. The quality of your answers depends entirely on:

06

Adding Source Citations

Production RAG systems need citations so users can verify answers. Here is how to return the source documents alongside the answer:

rag_with_sources.py
python
from langchain_core.runnables import RunnableParallel

# Chain that returns both the answer and source documents
rag_chain_with_sources = RunnableParallel(
    {
        "answer": (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | rag_prompt
            | model
            | StrOutputParser()
        ),
        "sources": retriever
    }
)

result = rag_chain_with_sources.invoke(
    "Does the company reimburse internet costs?"
)

print(f"Answer: {result['answer']}\n")
print("Sources:")
for doc in result["sources"]:
    print(f"  - [{doc.metadata['source']}] {doc.page_content[:60]}...")

RunnableParallel runs both branches at the same time — one generates the answer, the other returns the raw retrieved documents. The caller gets a dictionary with both the answer and the source chunks that produced it.

Day 1 Checkpoint

Before moving on, confirm you understand these core concepts:

Supporting References & Reading

Go deeper with these external resources.

Continue To Day 2
Document Loading & Chunking Strategies