How do database indexes actually work?

Most database indexes are implemented as B-trees (balanced tree data structures). A B-tree index stores sorted key values with pointers to the corresponding rows on disk. When you query WHERE id = 1234, the database traverses the B-tree (O(log n) time) to find the row pointer, then fetches the row. Without an index, the database does a full table scan (O(n)) — reading every row. A table with 10 million rows scans 10M rows without an index vs about 24 tree lookups with a B-tree index. The tradeoff: indexes speed up reads but slow down writes (the index must be updated on INSERT/UPDATE/DELETE).

What is ACID and why does it matter?

ACID is a set of properties that guarantee database transactions are processed reliably: Atomicity (the entire transaction succeeds or none of it does — no partial commits), Consistency (transactions bring the database from one valid state to another, enforcing constraints), Isolation (concurrent transactions don't see each other's intermediate states), Durability (committed transactions survive crashes, guaranteed by write-ahead logging). ACID compliance is what makes relational databases trustworthy for financial transactions, medical records, and any system where partial failures are unacceptable.

What is the difference between a clustered and non-clustered index?

A clustered index determines the physical order of data on disk. The table data itself is stored in the B-tree leaf nodes. There can only be one clustered index per table — in PostgreSQL this is the primary key by default. A non-clustered (secondary) index is a separate B-tree structure that stores the index key plus a pointer back to the actual row. Reading through a non-clustered index requires two lookups: find the row pointer in the index, then fetch the actual row data. Covering indexes include all needed columns in the index itself, eliminating the second lookup.

Database Internals Explained 2026: How Databases Really Work

Bottom Line

Database internals explained: B-trees, indexes, query execution, transactions, MVCC, and WAL — the core concepts behind how relational databases work under the hood.

Our Take

MVCC is the reason your database feels fast — and the reason it fills up disk quietly.

Multi-Version Concurrency Control is the most consequential internal mechanism that most developers never think about until something breaks. MVCC allows PostgreSQL to serve concurrent readers and writers without locking by keeping old row versions alive until no active transaction needs them. This makes reads non-blocking and writes fast. The hidden cost is table bloat: dead tuples accumulate until VACUUM clears them. Autovacuum runs in the background, but on high-write tables it can fall behind — and when it does, table size explodes and query plans degrade because statistics go stale.

The WAL (Write-Ahead Log) is similarly underappreciated. It is not just for crash recovery — it is the foundation of streaming replication in PostgreSQL. Every byte written to a replica travels through the WAL. Understanding WAL is the key to understanding replication lag, point-in-time recovery, and logical decoding for change-data-capture pipelines. Tools like Debezium, which powers a huge share of real-time data infrastructure at companies like LinkedIn and Confluent, work entirely by reading the PostgreSQL WAL. The internals are not academic — they are load-bearing.

For developers who want to move from "I can write queries" to "I understand what the database is doing," reading the PostgreSQL source code's storage/page directory is one of the highest-leverage hours you can spend. Alternatively, Alex Petrov's book Database Internals covers this terrain rigorously without requiring you to read C.

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts

Database Internals Explained 2026: How Databases Really Work

MVCC is the reason your database feels fast — and the reason it fills up disk quietly.

Published By

Precision AI Academy

Keep Reading

Data Engineering Guide 2026

Data Pipeline Guide

Data Modeling Guide