Day 4: Advanced Retrieval — Hybrid Search, Reranking & Query Transformation

Today's Objective

Build an advanced retrieval pipeline that uses hybrid search (BM25 + vector), reranks results with a cross-encoder, and generates multiple query variations to maximize recall. This is the retrieval stack used by production RAG systems at companies like Notion, Stripe, and Anthropic.

Simple similarity search works surprisingly well for many use cases, but it has real limitations. It misses exact keyword matches (searching for "HIPAA" might not rank a chunk containing "HIPAA" highest if other chunks are semantically similar). It struggles with short, ambiguous queries. And it returns results in order of vector similarity, which is not always the same as order of relevance. Today you learn the three techniques that production RAG systems use to overcome these limitations.

Hybrid Search — Keyword + Vector

Vector search captures semantic meaning ("laptop" matches "computer"). Keyword search (BM25) captures exact term matches ("HIPAA" matches "HIPAA"). Hybrid search combines both to get the best of both worlds.

Vector Search Only

Semantic Matching

Great at understanding meaning and paraphrases. Misses exact term matches. Can return vaguely related results. Struggles with acronyms, product names, and codes.

Hybrid Search

Semantic + Keyword

Catches both meaning and exact terms. More robust to query variation. Handles acronyms and domain-specific terms. Industry standard for production RAG.

hybrid_search.py

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# pip install rank-bm25

# 1. BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(
    chunks,
    k=5  # Return top 5 keyword matches
)

# 2. Vector retriever (from ChromaDB)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 3. Combine with EnsembleRetriever
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # 40% keyword, 60% semantic
)

# Test: exact term match
results = hybrid_retriever.invoke("HIPAA compliance requirements")
for r in results:
    print(f"- {r.page_content[:80]}...")

# Test: semantic match
results2 = hybrid_retriever.invoke("What health data rules apply?")
for r in results2:
    print(f"- {r.page_content[:80]}...")

Tuning the weights: Start with 0.4/0.6 (keyword/vector). If your documents have lots of domain-specific terminology, acronyms, or codes, shift toward 0.5/0.5 or even 0.6/0.4. If your queries are natural language questions, lean toward more vector weight.

Reranking with Cross-Encoders

Retrievers return results fast but imprecisely. They score each document independently against the query. A cross-encoder reranker takes the query and each candidate document together, scores the pair for relevance, and re-orders the results. This is dramatically more accurate because the model sees both the query and document simultaneously.

The typical pattern is: retrieve 20 candidates cheaply (fast), then rerank to the best 3–5 (accurate).

reranking.py

python

# Option 1: Cohere Reranker (API-based, highest quality)
# pip install cohere langchain-cohere
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

# Wrap the base retriever with a reranker
cohere_reranker = CohereRerank(
    model="rerank-english-v3.0",
    cohere_api_key="your-key",
    top_n=3  # Return top 3 after reranking
)

reranked_retriever = ContextualCompressionRetriever(
    base_compressor=cohere_reranker,
    base_retriever=hybrid_retriever  # From the hybrid search above
)

# The retriever now: hybrid search → 10 candidates → rerank → top 3
results = reranked_retriever.invoke("What are the VPN requirements?")
for r in results:
    print(f"- {r.page_content[:80]}...")

reranking_local.py

python

# Option 2: Local cross-encoder (free, no API key needed)
# pip install sentence-transformers
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

# Load a cross-encoder model
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
reranker = CrossEncoderReranker(
    model=cross_encoder,
    top_n=3
)

# Same pattern: wrap base retriever
local_reranked = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vector_retriever  # Retrieves 10, reranks to 3
)

results = local_reranked.invoke("What is the internet speed requirement?")
for r in results:
    print(f"- {r.page_content[:80]}...")

Retrieve many, rerank few. The retriever is cheap and fast but imprecise. The reranker is expensive and slow but accurate. Retrieve 10–20 candidates, rerank to 3–5. This gives you the precision of the cross-encoder at a fraction of the cost of running it over your entire corpus.

Multi-Query Retrieval

A single query might not capture everything the user needs. Multi-query retrieval uses the LLM to generate multiple variations of the user's question, runs all of them through the retriever, and deduplicates the results. This dramatically improves recall.

multi_query.py

python

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

# Create multi-query retriever
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_retriever,
    llm=model
)

# Enable logging to see the generated queries
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# User asks one question, the LLM generates 3 variations
results = multi_retriever.invoke(
    "What are the rules about working from home?"
)

# You'll see in the logs something like:
# Generated queries:
# 1. What is the company's remote work policy?
# 2. What are the requirements for working remotely?
# 3. What are the guidelines for telecommuting?

print(f"Retrieved {len(results)} unique documents from 3 query variations")
for r in results:
    print(f"- {r.page_content[:80]}...")

HyDE — Hypothetical Document Embeddings

HyDE is a clever technique: instead of embedding the raw question, ask the LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothesis is closer in embedding space to the actual answer in your corpus than the original question is.

hyde_retrieval.py

python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Step 1: Generate a hypothetical document
hyde_prompt = ChatPromptTemplate.from_template(
    """Write a short paragraph that would answer this question.
Write as if you are an authoritative source. Be specific and detailed.

Question: {question}

Hypothetical answer:"""
)

hyde_chain = hyde_prompt | model | StrOutputParser()

# Step 2: Embed the hypothetical answer (not the question!)
def hyde_search(question: str, vectorstore, k=3):
    # Generate hypothetical answer
    hypothesis = hyde_chain.invoke({"question": question})
    print(f"Hypothesis: {hypothesis[:100]}...")

    # Search using the hypothesis as the query
    results = vectorstore.similarity_search(hypothesis, k=k)
    return results

# Test: compare standard vs. HyDE retrieval
question = "Can a part-time employee work from home?"

print("--- Standard search ---")
standard = vectorstore.similarity_search(question, k=3)
for r in standard:
    print(f"  {r.page_content[:60]}...")

print("\n--- HyDE search ---")
hyde_results = hyde_search(question, vectorstore)
for r in hyde_results:
    print(f"  {r.page_content[:60]}...")

When HyDE hurts: HyDE adds latency (one extra LLM call) and can mislead retrieval if the hypothetical answer is wrong. It works best for complex, multi-faceted questions. For simple factoid queries ("What is the VPN requirement?"), standard search is faster and often better.

Contextual Compression

Retrieved chunks often contain irrelevant information alongside the relevant part. Contextual compression uses an LLM to extract only the relevant portions from each retrieved chunk, reducing noise in the final prompt.

contextual_compression.py

python

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# The compressor extracts only relevant portions
compressor = LLMChainExtractor.from_llm(model)

compressed_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_retriever
)

# Compare: standard vs. compressed retrieval
question = "What is the internet reimbursement policy?"

print("--- Standard (full chunks) ---")
standard = vector_retriever.invoke(question)
for r in standard:
    print(f"  [{len(r.page_content)} chars] {r.page_content[:80]}...")

print("\n--- Compressed (relevant parts only) ---")
compressed = compressed_retriever.invoke(question)
for r in compressed:
    print(f"  [{len(r.page_content)} chars] {r.page_content}")

The Complete Advanced Retrieval Pipeline

Here is how to combine all the techniques into a single, production-grade retrieval pipeline:

advanced_pipeline.py

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Layer 1: Hybrid search (BM25 + vector)
bm25 = BM25Retriever.from_documents(chunks, k=10)
vectorstore = Chroma.from_documents(chunks, embeddings)
vector_ret = vectorstore.as_retriever(search_kwargs={"k": 10})
hybrid = EnsembleRetriever(
    retrievers=[bm25, vector_ret], weights=[0.4, 0.6]
)

# Layer 2: Multi-query (generates 3 question variations)
multi_query = MultiQueryRetriever.from_llm(retriever=hybrid, llm=model)

# Layer 3: Rerank to top 5
cross_enc = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
reranker = CrossEncoderReranker(model=cross_enc, top_n=5)
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=multi_query
)

# Build the RAG chain
def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

rag_prompt = ChatPromptTemplate.from_template("""Answer based ONLY on:
{context}

Question: {question}
Answer:""")

chain = (
    {"context": final_retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt | model | StrOutputParser()
)

answer = chain.invoke("What security measures are required for remote workers?")
print(answer)

Start simple, add layers as needed. Do not build this full pipeline on Day 1. Start with plain vector search. Measure quality. Add hybrid search if keyword matching is weak. Add reranking if the top results are not the most relevant. Add multi-query if recall is low. Each layer adds latency and cost, so add them only when they solve a measured problem.

Day 4 Checkpoint

Before moving on, make sure you can answer these:

When does hybrid search outperform pure vector search? Give a concrete example.
What is the difference between a bi-encoder (embedding model) and a cross-encoder (reranker)?
Why retrieve 10–20 candidates and rerank to 3–5, rather than just retrieving 3–5?
Run the multi-query retriever and inspect the generated query variations.
Explain when HyDE helps and when it hurts.
Build the full advanced pipeline and compare its answers to Day 1's basic pipeline.

Supporting References & Reading

Go deeper with these external resources.

Docs

Cohere Rerank API Official documentation for Cohere's reranking models.

→

Research

HyDE: Hypothetical Document Embeddings The original paper proposing the HyDE technique.

→

Docs

LangChain Retriever Strategies Multi-query, ensemble, and compression retrievers in LangChain.

→

Continue To Day 5

Production RAG — Evaluation, Guardrails & Deployment

→