Load PDFs, web pages, Markdown, and CSV files into your RAG pipeline. Master text splitting strategies and understand how chunk size directly impacts retrieval quality.
Build a multi-source document ingestion pipeline that loads PDFs, web pages, Markdown, and CSV files, applies the right chunking strategy for each, and stores everything in a single vector store. By the end, you will have a reusable data pipeline for any RAG application.
Yesterday you built a RAG pipeline from a hardcoded string. That is fine for learning, but real RAG systems ingest thousands of documents from dozens of sources. The quality of your document loading and chunking is the single biggest factor in RAG performance — even more than the choice of embedding model or LLM. A bad chunking strategy will split key information across chunks, confuse the retriever, and produce terrible answers no matter how good your model is.
Today you will learn to load data from the four most common sources in enterprise RAG, and you will experiment with three different chunking strategies to understand when each one shines.
LangChain provides loaders for virtually every document format. Each loader returns a list of Document objects with page_content (the text) and metadata (source, page number, etc.). Metadata is critical for citations in production RAG.
PDF is the most common enterprise document format. LangChain offers several PDF loaders. PyPDFLoader is the simplest and most reliable for text-based PDFs.
from langchain_community.document_loaders import PyPDFLoader # pip install pypdf # Load a PDF — each page becomes a Document loader = PyPDFLoader("company-handbook.pdf") pages = loader.load() print(f"Loaded {len(pages)} pages") print(f"Page 1 metadata: {pages[0].metadata}") # {'source': 'company-handbook.pdf', 'page': 0} print(f"Page 1 content (first 200 chars):") print(pages[0].page_content[:200]) # For scanned PDFs (images), use OCR-based loaders # pip install unstructured from langchain_community.document_loaders import UnstructuredPDFLoader loader_ocr = UnstructuredPDFLoader( "scanned-document.pdf", mode="elements" # Preserves structure (headings, tables, etc.) ) elements = loader_ocr.load() # Each element has a category: Title, NarrativeText, Table, etc. for el in elements[:5]: print(f"[{el.metadata.get('category', 'unknown')}] {el.page_content[:60]}")
PyPDFLoader for text-based PDFs (most modern documents). Use UnstructuredPDFLoader when PDFs contain scanned images, complex tables, or mixed layouts. Unstructured is slower but handles messy documents much better.from langchain_community.document_loaders import WebBaseLoader import bs4 # pip install beautifulsoup4 # Load a single web page loader = WebBaseLoader( web_path="https://docs.python.org/3/tutorial/classes.html", bs_kwargs={ "parse_only": bs4.SoupStrainer( class_=("body-content",) # Only parse the main content ) } ) docs = loader.load() print(f"Loaded {len(docs)} document(s), {len(docs[0].page_content)} chars") # Load multiple URLs at once urls = [ "https://docs.python.org/3/tutorial/classes.html", "https://docs.python.org/3/tutorial/modules.html", "https://docs.python.org/3/tutorial/errors.html", ] loader_multi = WebBaseLoader(web_paths=urls) all_docs = loader_multi.load() print(f"Loaded {len(all_docs)} pages total")
from langchain_community.document_loaders import ( UnstructuredMarkdownLoader, CSVLoader, DirectoryLoader, ) # Load a Markdown file — preserves headers and structure md_loader = UnstructuredMarkdownLoader( "README.md", mode="elements" # Each section becomes a Document ) md_docs = md_loader.load() # Load a CSV — each row becomes a Document csv_loader = CSVLoader( "products.csv", csv_args={"delimiter": ","}, source_column="product_name" # Use this column as the source metadata ) csv_docs = csv_loader.load() print(f"CSV row 1: {csv_docs[0].page_content[:100]}") # Load an entire directory of mixed files dir_loader = DirectoryLoader( "./knowledge-base/", glob="**/*.md", # Only Markdown files show_progress=True, use_multithreading=True ) all_docs = dir_loader.load() print(f"Loaded {len(all_docs)} documents from directory")
source field in metadata. For PDFs, you also get page. For CSVs, you get row. Always preserve this metadata — it powers the source citations in your RAG responses.Once documents are loaded, you need to split them into chunks small enough for precise retrieval but large enough to contain complete thoughts. This is the most consequential decision in your RAG pipeline.
This is the default and best general-purpose strategy. It tries to split at natural boundaries (paragraphs, sentences, words) rather than cutting at arbitrary character positions.
from langchain.text_splitter import RecursiveCharacterTextSplitter # The workhorse splitter — use this by default splitter = RecursiveCharacterTextSplitter( chunk_size=500, # Target chunk size in characters chunk_overlap=50, # Overlap between adjacent chunks separators=[ "\n\n", # First try: paragraph boundaries "\n", # Then: line breaks ". ", # Then: sentence boundaries ", ", # Then: clause boundaries " ", # Then: word boundaries "" # Last resort: character boundaries ], length_function=len, ) # Split your loaded documents chunks = splitter.split_documents(pages) # pages from PyPDFLoader print(f"Input: {len(pages)} pages → Output: {len(chunks)} chunks") print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars") # Inspect a few chunks to verify quality for i, chunk in enumerate(chunks[:3]): print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---") print(chunk.page_content) print(f"Metadata: {chunk.metadata}")
Character counts do not map cleanly to token counts, and LLMs think in tokens. If you need precise control over context window usage, split by tokens instead.
from langchain.text_splitter import TokenTextSplitter # pip install tiktoken # Split by token count — matches how the LLM processes text token_splitter = TokenTextSplitter( encoding_name="cl100k_base", # GPT-4/GPT-4o tokenizer chunk_size=256, # 256 tokens per chunk chunk_overlap=32, # 32 token overlap ) token_chunks = token_splitter.split_documents(pages) print(f"Token-based: {len(token_chunks)} chunks") # Compare: same document, character vs. token splitting import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") for chunk in token_chunks[:3]: tokens = len(enc.encode(chunk.page_content)) chars = len(chunk.page_content) print(f"Tokens: {tokens}, Chars: {chars}, Ratio: {chars/tokens:.1f} chars/token")
Chunk size is the most important hyperparameter in RAG. Too small, and chunks lack context. Too large, and retrieval becomes imprecise because unrelated information gets bundled together. Let's run a controlled experiment.
Each chunk covers a single fact. Retrieval is very precise. But chunks may lack the context needed to answer complex questions. Works best for FAQ-style Q&A.
Each chunk covers a full section. Good for complex questions needing multiple facts. But retrieval is less precise — irrelevant content comes along for the ride.
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough embeddings = OpenAIEmbeddings(model="text-embedding-3-small") model = ChatOpenAI(model="gpt-4o-mini", temperature=0) rag_prompt = ChatPromptTemplate.from_template("""Answer based ONLY on this context: {context} Question: {question} Answer:""") def format_docs(docs): return "\n\n".join(d.page_content for d in docs) # Test three chunk sizes for size in [200, 500, 1000]: splitter = RecursiveCharacterTextSplitter( chunk_size=size, chunk_overlap=size // 10 ) chunks = splitter.split_documents(pages) # Create a separate collection for each experiment vs = Chroma.from_documents( chunks, embeddings, collection_name=f"exp-{size}" ) retriever = vs.as_retriever(search_kwargs={"k": 3}) chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | model | StrOutputParser() ) question = "What equipment does the company provide for remote workers?" answer = chain.invoke(question) print(f"\n--- Chunk size: {size} ({len(chunks)} chunks) ---") print(f"Answer: {answer}")
Run this experiment with your own documents and a set of test questions. You will find that 400–600 characters (roughly 100–150 tokens) is the sweet spot for most document types. Smaller chunks for short factoid questions, larger chunks for questions that require synthesizing multiple pieces of information.
Raw text chunks lose structural context. A chunk from page 47 of a 200-page manual means nothing without knowing which section it came from. Metadata enrichment adds that context back.
from langchain_core.documents import Document from langchain.text_splitter import RecursiveCharacterTextSplitter import re from datetime import datetime def enrich_chunks(chunks, doc_title, doc_type="policy"): """Add rich metadata to each chunk for better filtering and citations.""" enriched = [] for i, chunk in enumerate(chunks): # Detect section headers in the content section_match = re.search( r"(?:Section|Chapter|Part)\s+\d+[:\s]+(.*?)(?:\n|$)", chunk.page_content ) section = section_match.group(1).strip() if section_match else "Unknown" chunk.metadata.update({ "doc_title": doc_title, "doc_type": doc_type, "chunk_index": i, "total_chunks": len(chunks), "section": section, "char_count": len(chunk.page_content), "indexed_at": datetime.now().isoformat(), }) enriched.append(chunk) return enriched # Usage splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_documents(pages) enriched = enrich_chunks(chunks, "Remote Work Policy", "policy") # Now you can filter by metadata in retrieval # vectorstore.similarity_search("VPN", filter={"doc_type": "policy"})
Here is a reusable pipeline that loads documents from multiple sources, applies appropriate chunking, enriches metadata, and stores everything in ChromaDB.
from langchain_community.document_loaders import ( PyPDFLoader, WebBaseLoader, CSVLoader, DirectoryLoader, UnstructuredMarkdownLoader ) from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma from pathlib import Path def load_documents(source_dir: str) -> list: """Load all supported document types from a directory.""" all_docs = [] source = Path(source_dir) # PDFs for pdf in source.glob("**/*.pdf"): loader = PyPDFLoader(str(pdf)) all_docs.extend(loader.load()) # Markdown for md in source.glob("**/*.md"): loader = UnstructuredMarkdownLoader(str(md)) all_docs.extend(loader.load()) # CSVs for csv_file in source.glob("**/*.csv"): loader = CSVLoader(str(csv_file)) all_docs.extend(loader.load()) print(f"Loaded {len(all_docs)} documents from {source_dir}") return all_docs def chunk_and_store(docs, persist_dir="./chroma_db", collection="knowledge-base"): """Chunk documents and store in ChromaDB.""" splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50 ) chunks = splitter.split_documents(docs) print(f"Created {len(chunks)} chunks") embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, collection_name=collection, persist_directory=persist_dir, ) print(f"Stored {vectorstore._collection.count()} vectors") return vectorstore # Run the pipeline docs = load_documents("./knowledge-base") vectorstore = chunk_and_store(docs)
Before moving on, confirm you can do the following:
PyPDFLoader and inspect the page-level metadata.WebBaseLoader and filter to the main content using BeautifulSoup.RecursiveCharacterTextSplitter is better than splitting at fixed character positions.