Day 3: Vector Databases & Embeddings Deep Dive

Today's Objective

Compare three embedding models on the same dataset, build vector stores with ChromaDB and Pinecone, and implement metadata-filtered similarity search. You will know exactly which embedding model and vector database to choose for your project.

Embeddings are the engine of every RAG system. When you run a similarity search, you are comparing the mathematical representation of your question against the mathematical representations of your document chunks. If those representations are poor — if the embedding model does not capture semantic meaning accurately — your retriever will return irrelevant chunks and your answers will be wrong, regardless of how good your LLM is.

Today we go deep on how embeddings work, how to choose between models, and how vector databases store and search these embeddings efficiently. This is the most technically dense day of the course, but it is also the most important for building RAG systems that actually work well.

How Embeddings Work

An embedding is a vector — a list of floating-point numbers — that represents the semantic meaning of a piece of text. The key insight is that similar meanings produce similar vectors. "The cat sat on the mat" and "A feline rested on the rug" will have nearly identical embeddings, even though they share almost no words.

Embedding models are neural networks trained on massive text corpora. They learn to map text into a high-dimensional space where geometric distance corresponds to semantic similarity. OpenAI's text-embedding-3-small produces 1536-dimensional vectors. That means each chunk is represented by 1536 numbers, and similarity is measured by the angle between these vectors (cosine similarity).

understand_embeddings.py

python

from langchain_openai import OpenAIEmbeddings
import numpy as np

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed some text
texts = [
    "The company provides a laptop for remote workers.",
    "Remote employees receive a computer from the organization.",
    "The weather in Denver is sunny today.",
]
vectors = embeddings.embed_documents(texts)

print(f"Vector dimensions: {len(vectors[0])}")
print(f"First 5 values: {vectors[0][:5]}")

# Calculate cosine similarity between pairs
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Similar meaning → high similarity
sim_01 = cosine_sim(vectors[0], vectors[1])
print(f"\nLaptop ↔ Computer: {sim_01:.4f}")  # ~0.92

# Different meaning → low similarity
sim_02 = cosine_sim(vectors[0], vectors[2])
print(f"Laptop ↔ Weather:  {sim_02:.4f}")  # ~0.45

# Embed a query the same way
query_vector = embeddings.embed_query("What equipment do remote workers get?")
for i, text in enumerate(texts):
    sim = cosine_sim(query_vector, vectors[i])
    print(f"Query ↔ '{text[:50]}...': {sim:.4f}")

embed_documents vs. embed_query: Some embedding models (like Cohere) use different prefixes for documents vs. queries to improve retrieval accuracy. Always use embed_documents for your corpus and embed_query for search queries.

Comparing Embedding Models

Not all embeddings are equal. The choice of model directly impacts retrieval quality. Here are the major options and when to use each.

API-Based (Cloud)

OpenAI / Cohere

Highest quality, simplest integration, per-token pricing. Best for most production systems. No GPU needed. OpenAI: $0.02/M tokens. Cohere: $0.10/M tokens.

Open-Source (Local)

Sentence Transformers

Free, runs locally, full data privacy. Slightly lower quality for English but competitive on multilingual. Requires GPU for production speed. Best for air-gapped or high-volume systems.

compare_embeddings.py

python

# 1. OpenAI embeddings (best general-purpose)
from langchain_openai import OpenAIEmbeddings

openai_emb = OpenAIEmbeddings(
    model="text-embedding-3-small"  # 1536 dims, $0.02/M tokens
)
# or "text-embedding-3-large" — 3072 dims, $0.13/M tokens

# 2. Cohere embeddings (best for retrieval-specific tasks)
# pip install langchain-cohere
from langchain_cohere import CohereEmbeddings

cohere_emb = CohereEmbeddings(
    model="embed-english-v3.0",  # 1024 dims
    cohere_api_key="your-key"
)

# 3. Open-source: Sentence Transformers (free, local)
# pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceEmbeddings

local_emb = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",  # 384 dims, fast
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Benchmark: embed the same 100 chunks with each model
import time

for name, emb in [("OpenAI", openai_emb), ("Local-BGE", local_emb)]:
    start = time.time()
    vectors = emb.embed_documents([c.page_content for c in chunks[:100]])
    elapsed = time.time() - start
    print(f"{name}: {len(vectors[0])} dims, {elapsed:.2f}s for 100 chunks")

Which model to choose: For most RAG systems, start with text-embedding-3-small. It is cheap, fast, and high quality. Switch to Cohere embed-english-v3.0 if you need the absolute best retrieval quality. Use open-source (BGE or E5) if you need full data privacy or are embedding millions of documents and cost is a concern.

Vector Database Comparison

A vector database stores embeddings and enables fast similarity search. The choice depends on your scale, deployment model, and feature requirements.

ChromaDB — Local Development

ChromaDB runs in-process (no server needed), stores data on disk, and is perfect for development and small-to-medium production workloads (up to a few million vectors). You have been using it since Day 1.

chroma_advanced.py

python

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create with persistence
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my-docs",
    collection_metadata={"hnsw:space": "cosine"}  # Similarity metric
)

# Basic similarity search
results = vectorstore.similarity_search("VPN requirements", k=3)

# Similarity search with scores (lower = more similar for L2)
results_with_scores = vectorstore.similarity_search_with_relevance_scores(
    "VPN requirements", k=3
)
for doc, score in results_with_scores:
    print(f"Score: {score:.4f} | {doc.page_content[:60]}...")

# Metadata filtering — only search specific documents
filtered = vectorstore.similarity_search(
    "equipment policy",
    k=3,
    filter={"doc_type": "policy"}
)

# Complex filters with $and / $or
complex_filter = vectorstore.similarity_search(
    "security requirements",
    k=5,
    filter={
        "$and": [
            {"doc_type": {"$eq": "policy"}},
            {"version": {"$gte": "2025"}}
        ]
    }
)

Pinecone — Managed Cloud

Pinecone is a fully managed vector database. No infrastructure to maintain, automatic scaling, and a generous free tier. Best for production systems where you do not want to manage servers.

pinecone_example.py

python

# pip install langchain-pinecone pinecone-client
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone client
pc = Pinecone(api_key="your-pinecone-api-key")

# Create an index (run once)
index_name = "rag-course"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # Must match your embedding model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Create LangChain vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name,
    namespace="company-docs"  # Namespaces isolate data within an index
)

# Search with metadata filter
results = vectorstore.similarity_search(
    "security requirements",
    k=3,
    filter={"doc_type": {"$eq": "policy"}}
)

# Use as a retriever in a RAG chain
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5, "namespace": "company-docs"}
)

Pinecone namespaces let you store different document sets (e.g., "policies", "engineering-docs", "hr-docs") in a single index but search them separately. This is cheaper than creating multiple indexes and essential for multi-tenant RAG applications.

Similarity Search Algorithms

When you have millions of vectors, you cannot compare every one to the query vector — that would be far too slow. Vector databases use approximate nearest neighbor (ANN) algorithms to search efficiently.

The three most common algorithms:

HNSW (Hierarchical Navigable Small World) — Used by ChromaDB, Qdrant, and Weaviate. Builds a graph structure. Very fast queries, higher memory usage. Best for most use cases.
IVF (Inverted File Index) — Clusters vectors into partitions and only searches relevant partitions. Lower memory, slightly less accurate. Used in FAISS.
PQ (Product Quantization) — Compresses vectors to use less memory. Trades accuracy for scale. Used when you have billions of vectors.

For RAG applications with up to tens of millions of documents, HNSW is the standard choice. You get sub-millisecond search times with 99%+ recall accuracy.

Metadata Filtering in Practice

Pure vector similarity has a major limitation: it only considers semantic meaning, not document attributes. A query about "2026 security policy" will match any chunk about security, including outdated 2023 policies. Metadata filtering solves this by applying hard constraints before or after the vector search.

metadata_filtering.py

python

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Documents with rich metadata
docs = [
    Document(page_content="All employees must use VPN...",
             metadata={"dept": "IT", "year": 2026, "type": "policy"}),
    Document(page_content="VPN was optional before 2024...",
             metadata={"dept": "IT", "year": 2023, "type": "policy"}),
    Document(page_content="Q4 revenue exceeded targets...",
             metadata={"dept": "Finance", "year": 2026, "type": "report"}),
]

vs = Chroma.from_documents(docs, embeddings, collection_name="meta-demo")

# Without filter — returns both VPN docs AND finance doc
results_all = vs.similarity_search("VPN policy", k=3)
print("No filter:", [r.metadata["year"] for r in results_all])

# With filter — only current year
results_2026 = vs.similarity_search(
    "VPN policy", k=3,
    filter={"year": 2026}
)
print("2026 only:", [r.page_content[:40] for r in results_2026])

# Combined filter: IT department + current year
results_it = vs.similarity_search(
    "security requirements", k=3,
    filter={"$and": [{"dept": "IT"}, {"year": {"$gte": 2025}}]}
)
print("IT 2025+:", [r.page_content[:40] for r in results_it])

Choosing Your Stack

Here is a practical decision framework for choosing your embedding model and vector database:

Prototyping / small datasets (<100K chunks) — OpenAI embeddings + ChromaDB. Zero infrastructure, fast iteration, works locally.
Production / medium scale (100K–10M chunks) — OpenAI or Cohere embeddings + Pinecone or Qdrant Cloud. Managed infrastructure, automatic scaling.
Air-gapped / data-sensitive — BGE or E5 embeddings + Qdrant (self-hosted) or Weaviate (self-hosted). Full data privacy, no external API calls.
Massive scale (10M+ chunks) — Cohere embeddings (best retrieval quality) + Pinecone (scales to billions). Or FAISS for cost-sensitive workloads.

Never mix embedding models. If you embed your documents with OpenAI, you must query with OpenAI. Vectors from different models live in different vector spaces and cannot be compared. If you switch models, you must re-embed your entire corpus.

Day 3 Checkpoint

Before moving on, confirm you understand these concepts:

Explain cosine similarity and why it works for measuring semantic similarity.
What is the difference between embed_documents and embed_query, and why does it matter?
Run the embedding comparison code and note the quality/speed tradeoffs.
Create a ChromaDB store with metadata and perform a filtered search.
Explain why you cannot mix embeddings from different models in the same vector store.
Describe when you would choose Pinecone over ChromaDB (and vice versa).

Supporting References & Reading

Go deeper with these external resources.

Docs

OpenAI Embeddings API Official documentation for text-embedding-3-small and text-embedding-3-large.

→

Docs

Pinecone Documentation Getting started with Pinecone managed vector database.

→

Research

MTEB Embedding Benchmark Compare embedding models on retrieval quality across datasets.

→

Continue To Day 4

Advanced Retrieval — Hybrid Search, Reranking, Query Transformation

→