Key Takeaways
- Start with conceptual modeling (entities and relationships), then translate to logical, then physical schema
- Normalize to 3NF for OLTP systems — redundancy is the enemy of data integrity
- Every many-to-many relationship needs a junction table in the physical schema
- Index foreign keys and query predicates — poor indexing is the most common production performance problem
- OLAP (analytics) schemas intentionally denormalize into star or snowflake schemas for read speed
Start with Entities and Relationships — Not Tables
Good data modeling starts conceptually, before any SQL. Entity-Relationship (ER) modeling captures the business domain in terms of entities (things) and relationships (how things connect).
An entity is a noun in the business domain: Customer, Order, Product, Employee, Account. Entities become tables. An attribute is a property of an entity: customer name, order date, product price. Attributes become columns.
The three levels of data modeling:
- Conceptual — Entities and relationships only, no technical details. Tools: hand-drawn or Lucidchart ER diagrams.
- Logical — Adds attributes, primary/foreign keys, data types (conceptual). Still database-independent.
- Physical — Actual SQL DDL: specific data types, constraints, indexes, partitioning. Database-specific.
The Three Relationship Types
One-to-One (1:1) — One entity on each side. Rare and often signals the entities should merge. Example: a User and their Profile (if every user has exactly one profile). Implement with a foreign key + unique constraint on the child table.
One-to-Many (1:N) — One entity relates to many on the other side. The most common relationship. Example: a Customer has many Orders. The "many" side gets the foreign key: orders.customer_id REFERENCES customers(id).
Many-to-Many (M:N) — Many entities on both sides. Example: Students and Courses (a student takes many courses; a course has many students). Cannot be represented directly — requires a junction table:
-- Junction table for M:N student-course relationship
CREATE TABLE enrollments (
student_id BIGINT NOT NULL REFERENCES students(id),
course_id BIGINT NOT NULL REFERENCES courses(id),
enrolled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
grade CHAR(2),
PRIMARY KEY (student_id, course_id)
);
CREATE INDEX idx_enrollments_course ON enrollments(course_id);
Normalization: Eliminating Redundancy
Normalization is the formal process of structuring a schema to reduce redundancy. Each normal form fixes a specific class of problem:
| Normal Form | Rule | Problem Fixed |
|---|---|---|
| 1NF | Atomic values, no repeating groups | Arrays/lists in a column (store tags as separate rows, not CSV) |
| 2NF | No partial dependencies (every non-key attr depends on the full PK) | Composite key tables where some columns depend on only part of the key |
| 3NF | No transitive dependencies (non-key attrs don't depend on other non-key attrs) | Storing both zip_code and city — city depends on zip, not directly on the PK |
| BCNF | Every functional dependency has a superkey on the left | Edge cases in 3NF with multiple overlapping candidate keys |
For most production OLTP systems, 3NF is the right target. Don't go to 4NF/5NF unless you have a specific multi-valued dependency problem.
Translating an ER Diagram to SQL Schema
A real e-commerce example — customers, orders, products, and order line items:
CREATE TABLE customers (
id BIGSERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
sku TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
price_cents INTEGER NOT NULL CHECK (price_cents >= 0),
stock_qty INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL REFERENCES customers(id),
status TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES orders(id),
product_id BIGINT NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price INTEGER NOT NULL -- snapshot price at order time
);
-- Indexes on foreign keys (not created automatically in PostgreSQL)
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
Index Strategy: The Most Impactful Decision After Schema
After getting the schema right, index strategy is the biggest lever on query performance. Rules of thumb:
- Always index foreign keys in PostgreSQL — they are not indexed automatically, and unindexed FKs cause full table scans on joins.
- Index columns in WHERE clauses that filter large tables.
- Index ORDER BY columns if the query sorts results and doesn't filter first.
- Composite indexes — if you filter on (user_id, status), a composite index on (user_id, status) is better than two separate indexes. Column order matters: put the most selective column first.
- Covering indexes — if a query only needs columns that are all in the index, the database never touches the table. Extremely fast.
Common Schema Patterns
Soft deletes — Instead of DELETE, set a deleted_at TIMESTAMPTZ column. Add partial index: CREATE INDEX idx_users_active ON users(email) WHERE deleted_at IS NULL — the index only covers non-deleted rows.
Audit log — Create an audit_log table with table_name, record_id, action, old_values (JSONB), new_values (JSONB), changed_by, changed_at. Write to it via triggers.
Polymorphic associations — A comment that can belong to a Post or a Video. Options: separate tables per type (cleanest), abstract base table, or JSONB. The separate tables approach is usually best for schema clarity and referential integrity.
Hierarchical data — Categories with subcategories. Options: adjacency list (parent_id FK), nested sets, closure table, or PostgreSQL LTREE extension. Closure table is the most flexible for arbitrary-depth trees with efficient queries.
OLTP vs OLAP Schema Design
| Aspect | OLTP (Normalized) | OLAP (Denormalized) |
|---|---|---|
| Goal | Fast inserts, updates, deletes | Fast aggregation queries |
| Structure | 3NF normalized tables | Star/snowflake schema |
| Joins | Many joins — that's fine | Minimize joins — pre-join into wide tables |
| Redundancy | Eliminated | Intentional (copies of data for read speed) |
| Typical databases | PostgreSQL, MySQL | Snowflake, BigQuery, Redshift, ClickHouse |
Learn Data Modeling and SQL at Precision AI Academy
Our bootcamp teaches database design, schema optimization, and modern data engineering — from ER diagrams to production systems that scale. Five cities, October 2026.
Frequently Asked Questions
What is normalization and how far should I normalize?
Normalization eliminates redundancy by applying progressive rules (1NF → 2NF → 3NF). For most OLTP systems, 3NF is the target. Analytics schemas intentionally denormalize for read performance. Normalize first, then denormalize only where profiling shows a real problem.
When should I use a many-to-many relationship?
When entities on both sides can relate to multiple entities on the other side. Always implement with a junction table. The junction table often carries relationship attributes (enrollment date, role, etc.) making it a full entity in its own right.
Should I use UUIDs or auto-increment integers as primary keys?
Auto-increment integers are simpler and faster for single-database systems. UUIDs work better in distributed systems. Use UUID v7 (time-ordered) if you go the UUID route — it avoids B-tree fragmentation from random inserts.