Understand how embeddings work under the hood, compare embedding models, and master four vector databases. Learn similarity search algorithms and metadata filtering for precise retrieval.
Compare three embedding models on the same dataset, build vector stores with ChromaDB and Pinecone, and implement metadata-filtered similarity search. You will know exactly which embedding model and vector database to choose for your project.
Embeddings are the engine of every RAG system. When you run a similarity search, you are comparing the mathematical representation of your question against the mathematical representations of your document chunks. If those representations are poor — if the embedding model does not capture semantic meaning accurately — your retriever will return irrelevant chunks and your answers will be wrong, regardless of how good your LLM is.
Today we go deep on how embeddings work, how to choose between models, and how vector databases store and search these embeddings efficiently. This is the most technically dense day of the course, but it is also the most important for building RAG systems that actually work well.
An embedding is a vector — a list of floating-point numbers — that represents the semantic meaning of a piece of text. The key insight is that similar meanings produce similar vectors. "The cat sat on the mat" and "A feline rested on the rug" will have nearly identical embeddings, even though they share almost no words.
Embedding models are neural networks trained on massive text corpora. They learn to map text into a high-dimensional space where geometric distance corresponds to semantic similarity. OpenAI's text-embedding-3-small produces 1536-dimensional vectors. That means each chunk is represented by 1536 numbers, and similarity is measured by the angle between these vectors (cosine similarity).
from langchain_openai import OpenAIEmbeddings import numpy as np embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Embed some text texts = [ "The company provides a laptop for remote workers.", "Remote employees receive a computer from the organization.", "The weather in Denver is sunny today.", ] vectors = embeddings.embed_documents(texts) print(f"Vector dimensions: {len(vectors[0])}") print(f"First 5 values: {vectors[0][:5]}") # Calculate cosine similarity between pairs def cosine_sim(a, b): a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Similar meaning → high similarity sim_01 = cosine_sim(vectors[0], vectors[1]) print(f"\nLaptop ↔ Computer: {sim_01:.4f}") # ~0.92 # Different meaning → low similarity sim_02 = cosine_sim(vectors[0], vectors[2]) print(f"Laptop ↔ Weather: {sim_02:.4f}") # ~0.45 # Embed a query the same way query_vector = embeddings.embed_query("What equipment do remote workers get?") for i, text in enumerate(texts): sim = cosine_sim(query_vector, vectors[i]) print(f"Query ↔ '{text[:50]}...': {sim:.4f}")
embed_documents for your corpus and embed_query for search queries.Not all embeddings are equal. The choice of model directly impacts retrieval quality. Here are the major options and when to use each.
Highest quality, simplest integration, per-token pricing. Best for most production systems. No GPU needed. OpenAI: $0.02/M tokens. Cohere: $0.10/M tokens.
Free, runs locally, full data privacy. Slightly lower quality for English but competitive on multilingual. Requires GPU for production speed. Best for air-gapped or high-volume systems.
# 1. OpenAI embeddings (best general-purpose) from langchain_openai import OpenAIEmbeddings openai_emb = OpenAIEmbeddings( model="text-embedding-3-small" # 1536 dims, $0.02/M tokens ) # or "text-embedding-3-large" — 3072 dims, $0.13/M tokens # 2. Cohere embeddings (best for retrieval-specific tasks) # pip install langchain-cohere from langchain_cohere import CohereEmbeddings cohere_emb = CohereEmbeddings( model="embed-english-v3.0", # 1024 dims cohere_api_key="your-key" ) # 3. Open-source: Sentence Transformers (free, local) # pip install sentence-transformers from langchain_community.embeddings import HuggingFaceEmbeddings local_emb = HuggingFaceEmbeddings( model_name="BAAI/bge-small-en-v1.5", # 384 dims, fast model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True} ) # Benchmark: embed the same 100 chunks with each model import time for name, emb in [("OpenAI", openai_emb), ("Local-BGE", local_emb)]: start = time.time() vectors = emb.embed_documents([c.page_content for c in chunks[:100]]) elapsed = time.time() - start print(f"{name}: {len(vectors[0])} dims, {elapsed:.2f}s for 100 chunks")
text-embedding-3-small. It is cheap, fast, and high quality. Switch to Cohere embed-english-v3.0 if you need the absolute best retrieval quality. Use open-source (BGE or E5) if you need full data privacy or are embedding millions of documents and cost is a concern.A vector database stores embeddings and enables fast similarity search. The choice depends on your scale, deployment model, and feature requirements.
ChromaDB runs in-process (no server needed), stores data on disk, and is perfect for development and small-to-medium production workloads (up to a few million vectors). You have been using it since Day 1.
from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Create with persistence vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", collection_name="my-docs", collection_metadata={"hnsw:space": "cosine"} # Similarity metric ) # Basic similarity search results = vectorstore.similarity_search("VPN requirements", k=3) # Similarity search with scores (lower = more similar for L2) results_with_scores = vectorstore.similarity_search_with_relevance_scores( "VPN requirements", k=3 ) for doc, score in results_with_scores: print(f"Score: {score:.4f} | {doc.page_content[:60]}...") # Metadata filtering — only search specific documents filtered = vectorstore.similarity_search( "equipment policy", k=3, filter={"doc_type": "policy"} ) # Complex filters with $and / $or complex_filter = vectorstore.similarity_search( "security requirements", k=5, filter={ "$and": [ {"doc_type": {"$eq": "policy"}}, {"version": {"$gte": "2025"}} ] } )
Pinecone is a fully managed vector database. No infrastructure to maintain, automatic scaling, and a generous free tier. Best for production systems where you do not want to manage servers.
# pip install langchain-pinecone pinecone-client from langchain_pinecone import PineconeVectorStore from langchain_openai import OpenAIEmbeddings from pinecone import Pinecone, ServerlessSpec # Initialize Pinecone client pc = Pinecone(api_key="your-pinecone-api-key") # Create an index (run once) index_name = "rag-course" if index_name not in pc.list_indexes().names(): pc.create_index( name=index_name, dimension=1536, # Must match your embedding model metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") ) # Create LangChain vector store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = PineconeVectorStore.from_documents( documents=chunks, embedding=embeddings, index_name=index_name, namespace="company-docs" # Namespaces isolate data within an index ) # Search with metadata filter results = vectorstore.similarity_search( "security requirements", k=3, filter={"doc_type": {"$eq": "policy"}} ) # Use as a retriever in a RAG chain retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5, "namespace": "company-docs"} )
When you have millions of vectors, you cannot compare every one to the query vector — that would be far too slow. Vector databases use approximate nearest neighbor (ANN) algorithms to search efficiently.
The three most common algorithms:
For RAG applications with up to tens of millions of documents, HNSW is the standard choice. You get sub-millisecond search times with 99%+ recall accuracy.
Pure vector similarity has a major limitation: it only considers semantic meaning, not document attributes. A query about "2026 security policy" will match any chunk about security, including outdated 2023 policies. Metadata filtering solves this by applying hard constraints before or after the vector search.
from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings from langchain_core.documents import Document embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Documents with rich metadata docs = [ Document(page_content="All employees must use VPN...", metadata={"dept": "IT", "year": 2026, "type": "policy"}), Document(page_content="VPN was optional before 2024...", metadata={"dept": "IT", "year": 2023, "type": "policy"}), Document(page_content="Q4 revenue exceeded targets...", metadata={"dept": "Finance", "year": 2026, "type": "report"}), ] vs = Chroma.from_documents(docs, embeddings, collection_name="meta-demo") # Without filter — returns both VPN docs AND finance doc results_all = vs.similarity_search("VPN policy", k=3) print("No filter:", [r.metadata["year"] for r in results_all]) # With filter — only current year results_2026 = vs.similarity_search( "VPN policy", k=3, filter={"year": 2026} ) print("2026 only:", [r.page_content[:40] for r in results_2026]) # Combined filter: IT department + current year results_it = vs.similarity_search( "security requirements", k=3, filter={"$and": [{"dept": "IT"}, {"year": {"$gte": 2025}}]} ) print("IT 2025+:", [r.page_content[:40] for r in results_it])
Here is a practical decision framework for choosing your embedding model and vector database:
Before moving on, confirm you understand these concepts:
embed_documents and embed_query, and why does it matter?