Day 02 Data Ingestion

Document Loading & Chunking Strategies

Load PDFs, web pages, Markdown, and CSV files into your RAG pipeline. Master text splitting strategies and understand how chunk size directly impacts retrieval quality.

~1 hour Hands-on Precision AI Academy

Today's Objective

Build a multi-source document ingestion pipeline that loads PDFs, web pages, Markdown, and CSV files, applies the right chunking strategy for each, and stores everything in a single vector store. By the end, you will have a reusable data pipeline for any RAG application.

Yesterday you built a RAG pipeline from a hardcoded string. That is fine for learning, but real RAG systems ingest thousands of documents from dozens of sources. The quality of your document loading and chunking is the single biggest factor in RAG performance — even more than the choice of embedding model or LLM. A bad chunking strategy will split key information across chunks, confuse the retriever, and produce terrible answers no matter how good your model is.

Today you will learn to load data from the four most common sources in enterprise RAG, and you will experiment with three different chunking strategies to understand when each one shines.

01

Document Loaders

LangChain provides loaders for virtually every document format. Each loader returns a list of Document objects with page_content (the text) and metadata (source, page number, etc.). Metadata is critical for citations in production RAG.

Loading PDFs

PDF is the most common enterprise document format. LangChain offers several PDF loaders. PyPDFLoader is the simplest and most reliable for text-based PDFs.

load_pdf.py
python
from langchain_community.document_loaders import PyPDFLoader

# pip install pypdf

# Load a PDF — each page becomes a Document
loader = PyPDFLoader("company-handbook.pdf")
pages = loader.load()

print(f"Loaded {len(pages)} pages")
print(f"Page 1 metadata: {pages[0].metadata}")
# {'source': 'company-handbook.pdf', 'page': 0}

print(f"Page 1 content (first 200 chars):")
print(pages[0].page_content[:200])

# For scanned PDFs (images), use OCR-based loaders
# pip install unstructured
from langchain_community.document_loaders import UnstructuredPDFLoader

loader_ocr = UnstructuredPDFLoader(
    "scanned-document.pdf",
    mode="elements"  # Preserves structure (headings, tables, etc.)
)
elements = loader_ocr.load()

# Each element has a category: Title, NarrativeText, Table, etc.
for el in elements[:5]:
    print(f"[{el.metadata.get('category', 'unknown')}] {el.page_content[:60]}")
PyPDF vs. Unstructured: Use PyPDFLoader for text-based PDFs (most modern documents). Use UnstructuredPDFLoader when PDFs contain scanned images, complex tables, or mixed layouts. Unstructured is slower but handles messy documents much better.

Loading Web Pages

load_web.py
python
from langchain_community.document_loaders import WebBaseLoader
import bs4

# pip install beautifulsoup4

# Load a single web page
loader = WebBaseLoader(
    web_path="https://docs.python.org/3/tutorial/classes.html",
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("body-content",)  # Only parse the main content
        )
    }
)
docs = loader.load()
print(f"Loaded {len(docs)} document(s), {len(docs[0].page_content)} chars")

# Load multiple URLs at once
urls = [
    "https://docs.python.org/3/tutorial/classes.html",
    "https://docs.python.org/3/tutorial/modules.html",
    "https://docs.python.org/3/tutorial/errors.html",
]
loader_multi = WebBaseLoader(web_paths=urls)
all_docs = loader_multi.load()
print(f"Loaded {len(all_docs)} pages total")

Loading Markdown and CSV

load_markdown_csv.py
python
from langchain_community.document_loaders import (
    UnstructuredMarkdownLoader,
    CSVLoader,
    DirectoryLoader,
)

# Load a Markdown file — preserves headers and structure
md_loader = UnstructuredMarkdownLoader(
    "README.md",
    mode="elements"  # Each section becomes a Document
)
md_docs = md_loader.load()

# Load a CSV — each row becomes a Document
csv_loader = CSVLoader(
    "products.csv",
    csv_args={"delimiter": ","},
    source_column="product_name"  # Use this column as the source metadata
)
csv_docs = csv_loader.load()
print(f"CSV row 1: {csv_docs[0].page_content[:100]}")

# Load an entire directory of mixed files
dir_loader = DirectoryLoader(
    "./knowledge-base/",
    glob="**/*.md",  # Only Markdown files
    show_progress=True,
    use_multithreading=True
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} documents from directory")
Metadata matters. Every loader automatically sets a source field in metadata. For PDFs, you also get page. For CSVs, you get row. Always preserve this metadata — it powers the source citations in your RAG responses.
02

Text Splitting Strategies

Once documents are loaded, you need to split them into chunks small enough for precise retrieval but large enough to contain complete thoughts. This is the most consequential decision in your RAG pipeline.

Strategy 1: Recursive Character Splitting

This is the default and best general-purpose strategy. It tries to split at natural boundaries (paragraphs, sentences, words) rather than cutting at arbitrary character positions.

recursive_splitter.py
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# The workhorse splitter — use this by default
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # Target chunk size in characters
    chunk_overlap=50,     # Overlap between adjacent chunks
    separators=[
        "\n\n",   # First try: paragraph boundaries
        "\n",     # Then: line breaks
        ". ",     # Then: sentence boundaries
        ", ",     # Then: clause boundaries
        " ",      # Then: word boundaries
        ""        # Last resort: character boundaries
    ],
    length_function=len,
)

# Split your loaded documents
chunks = splitter.split_documents(pages)  # pages from PyPDFLoader

print(f"Input: {len(pages)} pages → Output: {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")

# Inspect a few chunks to verify quality
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content)
    print(f"Metadata: {chunk.metadata}")

Strategy 2: Token-Based Splitting

Character counts do not map cleanly to token counts, and LLMs think in tokens. If you need precise control over context window usage, split by tokens instead.

token_splitter.py
python
from langchain.text_splitter import TokenTextSplitter

# pip install tiktoken

# Split by token count — matches how the LLM processes text
token_splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # GPT-4/GPT-4o tokenizer
    chunk_size=256,               # 256 tokens per chunk
    chunk_overlap=32,             # 32 token overlap
)

token_chunks = token_splitter.split_documents(pages)
print(f"Token-based: {len(token_chunks)} chunks")

# Compare: same document, character vs. token splitting
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

for chunk in token_chunks[:3]:
    tokens = len(enc.encode(chunk.page_content))
    chars = len(chunk.page_content)
    print(f"Tokens: {tokens}, Chars: {chars}, Ratio: {chars/tokens:.1f} chars/token")
When to use token splitting: When you are stuffing retrieved chunks into a prompt with a hard token limit and need to guarantee you never exceed it. For most RAG applications, recursive character splitting is simpler and works just as well.
03

Chunk Size Experiments

Chunk size is the most important hyperparameter in RAG. Too small, and chunks lack context. Too large, and retrieval becomes imprecise because unrelated information gets bundled together. Let's run a controlled experiment.

Small Chunks (200 chars)

High Precision

Each chunk covers a single fact. Retrieval is very precise. But chunks may lack the context needed to answer complex questions. Works best for FAQ-style Q&A.

Large Chunks (1500 chars)

More Context

Each chunk covers a full section. Good for complex questions needing multiple facts. But retrieval is less precise — irrelevant content comes along for the ride.

chunk_experiment.py
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

rag_prompt = ChatPromptTemplate.from_template("""Answer based ONLY on this context:
{context}

Question: {question}
Answer:""")

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

# Test three chunk sizes
for size in [200, 500, 1000]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size, chunk_overlap=size // 10
    )
    chunks = splitter.split_documents(pages)

    # Create a separate collection for each experiment
    vs = Chroma.from_documents(
        chunks, embeddings, collection_name=f"exp-{size}"
    )
    retriever = vs.as_retriever(search_kwargs={"k": 3})

    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | rag_prompt | model | StrOutputParser()
    )

    question = "What equipment does the company provide for remote workers?"
    answer = chain.invoke(question)

    print(f"\n--- Chunk size: {size} ({len(chunks)} chunks) ---")
    print(f"Answer: {answer}")

Run this experiment with your own documents and a set of test questions. You will find that 400–600 characters (roughly 100–150 tokens) is the sweet spot for most document types. Smaller chunks for short factoid questions, larger chunks for questions that require synthesizing multiple pieces of information.

04

Metadata Enrichment

Raw text chunks lose structural context. A chunk from page 47 of a 200-page manual means nothing without knowing which section it came from. Metadata enrichment adds that context back.

enrich_metadata.py
python
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re
from datetime import datetime

def enrich_chunks(chunks, doc_title, doc_type="policy"):
    """Add rich metadata to each chunk for better filtering and citations."""
    enriched = []
    for i, chunk in enumerate(chunks):
        # Detect section headers in the content
        section_match = re.search(
            r"(?:Section|Chapter|Part)\s+\d+[:\s]+(.*?)(?:\n|$)",
            chunk.page_content
        )
        section = section_match.group(1).strip() if section_match else "Unknown"

        chunk.metadata.update({
            "doc_title": doc_title,
            "doc_type": doc_type,
            "chunk_index": i,
            "total_chunks": len(chunks),
            "section": section,
            "char_count": len(chunk.page_content),
            "indexed_at": datetime.now().isoformat(),
        })
        enriched.append(chunk)
    return enriched

# Usage
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(pages)
enriched = enrich_chunks(chunks, "Remote Work Policy", "policy")

# Now you can filter by metadata in retrieval
# vectorstore.similarity_search("VPN", filter={"doc_type": "policy"})
Do not skip metadata. In production RAG with hundreds of documents, metadata filtering is how you scope retrieval to relevant sources. Without it, a question about your security policy might retrieve chunks from your vacation policy just because they share similar vocabulary.
05

The Complete Ingestion Pipeline

Here is a reusable pipeline that loads documents from multiple sources, applies appropriate chunking, enriches metadata, and stores everything in ChromaDB.

ingest_pipeline.py
python
from langchain_community.document_loaders import (
    PyPDFLoader, WebBaseLoader, CSVLoader, DirectoryLoader,
    UnstructuredMarkdownLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from pathlib import Path

def load_documents(source_dir: str) -> list:
    """Load all supported document types from a directory."""
    all_docs = []
    source = Path(source_dir)

    # PDFs
    for pdf in source.glob("**/*.pdf"):
        loader = PyPDFLoader(str(pdf))
        all_docs.extend(loader.load())

    # Markdown
    for md in source.glob("**/*.md"):
        loader = UnstructuredMarkdownLoader(str(md))
        all_docs.extend(loader.load())

    # CSVs
    for csv_file in source.glob("**/*.csv"):
        loader = CSVLoader(str(csv_file))
        all_docs.extend(loader.load())

    print(f"Loaded {len(all_docs)} documents from {source_dir}")
    return all_docs

def chunk_and_store(docs, persist_dir="./chroma_db", collection="knowledge-base"):
    """Chunk documents and store in ChromaDB."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500, chunk_overlap=50
    )
    chunks = splitter.split_documents(docs)
    print(f"Created {len(chunks)} chunks")

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=collection,
        persist_directory=persist_dir,
    )
    print(f"Stored {vectorstore._collection.count()} vectors")
    return vectorstore

# Run the pipeline
docs = load_documents("./knowledge-base")
vectorstore = chunk_and_store(docs)

Day 2 Checkpoint

Before moving on, confirm you can do the following:

Supporting References & Reading

Go deeper with these external resources.

Continue To Day 3
Vector Databases & Embeddings Deep Dive