Go beyond basic vector search. Combine keyword and semantic retrieval, rerank with cross-encoders, and transform queries with HyDE, multi-query, and contextual compression for dramatically better results.
Build an advanced retrieval pipeline that uses hybrid search (BM25 + vector), reranks results with a cross-encoder, and generates multiple query variations to maximize recall. This is the retrieval stack used by production RAG systems at companies like Notion, Stripe, and Anthropic.
Simple similarity search works surprisingly well for many use cases, but it has real limitations. It misses exact keyword matches (searching for "HIPAA" might not rank a chunk containing "HIPAA" highest if other chunks are semantically similar). It struggles with short, ambiguous queries. And it returns results in order of vector similarity, which is not always the same as order of relevance. Today you learn the three techniques that production RAG systems use to overcome these limitations.
Vector search captures semantic meaning ("laptop" matches "computer"). Keyword search (BM25) captures exact term matches ("HIPAA" matches "HIPAA"). Hybrid search combines both to get the best of both worlds.
Great at understanding meaning and paraphrases. Misses exact term matches. Can return vaguely related results. Struggles with acronyms, product names, and codes.
Catches both meaning and exact terms. More robust to query variation. Handles acronyms and domain-specific terms. Industry standard for production RAG.
from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings # pip install rank-bm25 # 1. BM25 keyword retriever bm25_retriever = BM25Retriever.from_documents( chunks, k=5 # Return top 5 keyword matches ) # 2. Vector retriever (from ChromaDB) embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents(chunks, embeddings) vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # 3. Combine with EnsembleRetriever hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] # 40% keyword, 60% semantic ) # Test: exact term match results = hybrid_retriever.invoke("HIPAA compliance requirements") for r in results: print(f"- {r.page_content[:80]}...") # Test: semantic match results2 = hybrid_retriever.invoke("What health data rules apply?") for r in results2: print(f"- {r.page_content[:80]}...")
Retrievers return results fast but imprecisely. They score each document independently against the query. A cross-encoder reranker takes the query and each candidate document together, scores the pair for relevance, and re-orders the results. This is dramatically more accurate because the model sees both the query and document simultaneously.
The typical pattern is: retrieve 20 candidates cheaply (fast), then rerank to the best 3–5 (accurate).
# Option 1: Cohere Reranker (API-based, highest quality) # pip install cohere langchain-cohere from langchain_cohere import CohereRerank from langchain.retrievers import ContextualCompressionRetriever # Wrap the base retriever with a reranker cohere_reranker = CohereRerank( model="rerank-english-v3.0", cohere_api_key="your-key", top_n=3 # Return top 3 after reranking ) reranked_retriever = ContextualCompressionRetriever( base_compressor=cohere_reranker, base_retriever=hybrid_retriever # From the hybrid search above ) # The retriever now: hybrid search → 10 candidates → rerank → top 3 results = reranked_retriever.invoke("What are the VPN requirements?") for r in results: print(f"- {r.page_content[:80]}...")
# Option 2: Local cross-encoder (free, no API key needed) # pip install sentence-transformers from langchain_community.cross_encoders import HuggingFaceCrossEncoder from langchain.retrievers.document_compressors import CrossEncoderReranker from langchain.retrievers import ContextualCompressionRetriever # Load a cross-encoder model cross_encoder = HuggingFaceCrossEncoder( model_name="cross-encoder/ms-marco-MiniLM-L-6-v2" ) reranker = CrossEncoderReranker( model=cross_encoder, top_n=3 ) # Same pattern: wrap base retriever local_reranked = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=vector_retriever # Retrieves 10, reranks to 3 ) results = local_reranked.invoke("What is the internet speed requirement?") for r in results: print(f"- {r.page_content[:80]}...")
A single query might not capture everything the user needs. Multi-query retrieval uses the LLM to generate multiple variations of the user's question, runs all of them through the retriever, and deduplicates the results. This dramatically improves recall.
from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_openai import ChatOpenAI model = ChatOpenAI(model="gpt-4o-mini", temperature=0.3) # Create multi-query retriever multi_retriever = MultiQueryRetriever.from_llm( retriever=vector_retriever, llm=model ) # Enable logging to see the generated queries import logging logging.basicConfig() logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO) # User asks one question, the LLM generates 3 variations results = multi_retriever.invoke( "What are the rules about working from home?" ) # You'll see in the logs something like: # Generated queries: # 1. What is the company's remote work policy? # 2. What are the requirements for working remotely? # 3. What are the guidelines for telecommuting? print(f"Retrieved {len(results)} unique documents from 3 query variations") for r in results: print(f"- {r.page_content[:80]}...")
HyDE is a clever technique: instead of embedding the raw question, ask the LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothesis is closer in embedding space to the actual answer in your corpus than the original question is.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough model = ChatOpenAI(model="gpt-4o-mini", temperature=0) embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Step 1: Generate a hypothetical document hyde_prompt = ChatPromptTemplate.from_template( """Write a short paragraph that would answer this question. Write as if you are an authoritative source. Be specific and detailed. Question: {question} Hypothetical answer:""" ) hyde_chain = hyde_prompt | model | StrOutputParser() # Step 2: Embed the hypothetical answer (not the question!) def hyde_search(question: str, vectorstore, k=3): # Generate hypothetical answer hypothesis = hyde_chain.invoke({"question": question}) print(f"Hypothesis: {hypothesis[:100]}...") # Search using the hypothesis as the query results = vectorstore.similarity_search(hypothesis, k=k) return results # Test: compare standard vs. HyDE retrieval question = "Can a part-time employee work from home?" print("--- Standard search ---") standard = vectorstore.similarity_search(question, k=3) for r in standard: print(f" {r.page_content[:60]}...") print("\n--- HyDE search ---") hyde_results = hyde_search(question, vectorstore) for r in hyde_results: print(f" {r.page_content[:60]}...")
Retrieved chunks often contain irrelevant information alongside the relevant part. Contextual compression uses an LLM to extract only the relevant portions from each retrieved chunk, reducing noise in the final prompt.
from langchain.retrievers.document_compressors import LLMChainExtractor from langchain.retrievers import ContextualCompressionRetriever from langchain_openai import ChatOpenAI model = ChatOpenAI(model="gpt-4o-mini", temperature=0) # The compressor extracts only relevant portions compressor = LLMChainExtractor.from_llm(model) compressed_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vector_retriever ) # Compare: standard vs. compressed retrieval question = "What is the internet reimbursement policy?" print("--- Standard (full chunks) ---") standard = vector_retriever.invoke(question) for r in standard: print(f" [{len(r.page_content)} chars] {r.page_content[:80]}...") print("\n--- Compressed (relevant parts only) ---") compressed = compressed_retriever.invoke(question) for r in compressed: print(f" [{len(r.page_content)} chars] {r.page_content}")
Here is how to combine all the techniques into a single, production-grade retrieval pipeline:
from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_community.cross_encoders import HuggingFaceCrossEncoder from langchain.retrievers.document_compressors import CrossEncoderReranker from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough model = ChatOpenAI(model="gpt-4o-mini", temperature=0) embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Layer 1: Hybrid search (BM25 + vector) bm25 = BM25Retriever.from_documents(chunks, k=10) vectorstore = Chroma.from_documents(chunks, embeddings) vector_ret = vectorstore.as_retriever(search_kwargs={"k": 10}) hybrid = EnsembleRetriever( retrievers=[bm25, vector_ret], weights=[0.4, 0.6] ) # Layer 2: Multi-query (generates 3 question variations) multi_query = MultiQueryRetriever.from_llm(retriever=hybrid, llm=model) # Layer 3: Rerank to top 5 cross_enc = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2") reranker = CrossEncoderReranker(model=cross_enc, top_n=5) final_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=multi_query ) # Build the RAG chain def format_docs(docs): return "\n\n".join(d.page_content for d in docs) rag_prompt = ChatPromptTemplate.from_template("""Answer based ONLY on: {context} Question: {question} Answer:""") chain = ( {"context": final_retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt | model | StrOutputParser() ) answer = chain.invoke("What security measures are required for remote workers?") print(answer)
Before moving on, make sure you can answer these: