Database Internals Explained: How Databases Actually Work

Key Takeaways

  • Databases store data in pages (typically 8KB blocks) — understanding this explains index and query behavior
  • B-tree indexes reduce query time from O(n) to O(log n) — the difference between 10 million lookups and 24
  • ACID transactions guarantee atomicity, consistency, isolation, and durability
  • MVCC (Multi-Version Concurrency Control) enables reads without blocking writes — key to PostgreSQL's performance
  • Write-Ahead Logging (WAL) is how databases survive crashes without losing committed data

How Databases Store Data: Pages and Storage Engines

Every database reads and writes data in fixed-size blocks called pages — typically 8KB in PostgreSQL, 16KB in MySQL. Understanding this is the foundation of understanding everything else about databases.

When you run SELECT * FROM users WHERE id = 5, the database doesn't read individual rows — it reads whole pages from disk into a memory buffer pool (the buffer cache), then finds your row within the page. This is why returning one row from a million-row table with a good index isn't much slower than returning one row from a 100-row table — the work is in the page reads, and the index tells you exactly which page to read.

Storage engines handle the physical layer:

Indexes and B-Trees: The Core of Query Performance

A B-tree (Balanced Tree) index is the data structure behind virtually every database index. It's a self-balancing tree where every path from root to leaf has the same length, ensuring O(log n) search time regardless of which value you look for.

B-tree structure for a simple index on user_id:

For a table with 10 million rows, a B-tree of order 100 has height about 3-4. That's 3-4 page reads to find any row — compared to scanning all 10 million rows without an index.

Index types in PostgreSQL:

Index TypeStructureBest For
B-tree (default)Balanced treeEquality, range queries, sorting
HashHash tableEquality lookups only, faster than B-tree for exact match
GiSTGeneralized search treeGeometric data, full-text search, range types
GINGeneralized inverted indexFull-text search, JSONB, array columns
BRINBlock range indexVery large tables with natural ordering (timestamps)

Query Execution: From SQL to Data

When you submit a SQL query, the database goes through several stages:

  1. Parsing — SQL text is parsed into an abstract syntax tree (AST)
  2. Rewriting — Views are expanded, rules are applied
  3. Planning/Optimization — The query planner generates multiple execution plans and estimates their cost using table statistics. Picks the cheapest plan.
  4. Execution — The chosen plan is executed. Each node in the plan tree (Seq Scan, Index Scan, Hash Join, etc.) processes data and passes results up.

Always use EXPLAIN ANALYZE to understand what your database is actually doing:

EXPLAIN ANALYZE SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id, u.name;

-- Output shows:
-- Hash Aggregate (actual rows, actual time)
-- -> Hash Left Join (hash cond: o.user_id = u.id)
--    -> Index Scan on users (using idx_users_created_at)
--    -> Seq Scan on orders

Key things to look for: Seq Scan on large tables (needs an index), high row estimates vs actual (stale statistics — run ANALYZE), expensive Sort operations (add index for the ORDER BY column).

Transactions and ACID

A transaction is a unit of work that either completes entirely or not at all. The ACID properties guarantee reliability:

Isolation levels (weakest to strongest):

LevelDirty ReadNon-Repeatable ReadPhantom ReadUse When
Read UncommittedPossiblePossiblePossibleRarely — analytics only
Read Committed (default)NoPossiblePossibleMost OLTP workloads
Repeatable ReadNoNoPossibleWhen consistent reads matter
SerializableNoNoNoFinancial, auditing systems

MVCC: How PostgreSQL Handles Concurrent Access

Multi-Version Concurrency Control (MVCC) is the technique PostgreSQL uses to allow reads and writes to happen concurrently without blocking each other. Instead of locking rows for reads, the database keeps multiple versions of each row.

Each row version has hidden system columns: xmin (the transaction ID that created it) and xmax (the transaction ID that deleted/updated it, or 0 if current). When you read, you see only row versions that were committed before your transaction started — a consistent snapshot in time.

The tradeoff: old row versions accumulate and must be cleaned up by the VACUUM process. AUTOVACUUM handles this automatically, but understanding it matters for high-write systems.

Write-Ahead Logging: Durability Without Sacrifice

The Write-Ahead Log (WAL) is how databases guarantee durability. Before any change is written to the actual data files, the change is written to the WAL — a sequential append-only log on disk. Sequential writes are fast; random writes are slow. WAL gets durability cheaply.

On crash recovery, the database replays WAL entries to reconstruct any changes that made it to the log but not to the data files. This is why committed transactions survive crashes.

WAL also enables streaming replication — the primary server sends WAL records to replica servers in real time. Replicas apply the WAL and stay in sync. This is how PostgreSQL, MySQL, and most databases do high availability.

SQL vs NoSQL: Choosing the Right Tool

NeedReach ForWhy
Complex queries, joins, reportingPostgreSQL, MySQLSQL is purpose-built for relational queries
Flexible document storageMongoDB, DynamoDBSchema-less, good for varying document shapes
Key-value cacheRedis, MemcachedIn-memory, sub-millisecond reads
Time-series dataInfluxDB, TimescaleDBOptimized ingestion and range queries over time
Graph relationshipsNeo4j, Amazon NeptuneTraversals that would be JOIN hell in SQL
Write-heavy scaleCassandra, DynamoDBLSM trees optimized for high-volume writes

Master Databases at Precision AI Academy

Our bootcamp covers SQL, query optimization, database design, and modern data stacks — skills every backend developer and data engineer needs. Five cities, October 2026.

$1,490 · October 2026 · Denver, LA, NYC, Chicago, Dallas
Reserve Your Seat

Frequently Asked Questions

How do database indexes actually work?

B-tree indexes store sorted key values with row pointers. Searching is O(log n) vs O(n) full table scan. A 10M-row table takes ~24 B-tree lookups vs 10M row reads without an index. Indexes speed up reads but slow down writes.

What is ACID and why does it matter?

ACID guarantees that database transactions are processed reliably — all or nothing (atomicity), consistent state (consistency), concurrent isolation, and crash durability. Critical for financial, medical, and any mission-critical system.

What is the difference between a clustered and non-clustered index?

A clustered index stores the actual row data in the B-tree leaf nodes — one per table. Non-clustered indexes store key values plus pointers, requiring a second lookup to fetch the row. Covering indexes include all needed columns to avoid that second lookup.

BP
Bo Peng

Founder of Precision AI Academy. Software engineer with production database experience at scale. Teaches data engineering, SQL optimization, and modern database architectures.