Data Pipeline Guide: ETL, ELT, and Modern Architectures

Key Takeaways

A data pipeline is the infrastructure that moves data from where it originates to where it creates value. Modern data engineering has standardized around ELT architectures, SQL-based transformations with dbt, and orchestration with Airflow. This guide covers the patterns and tools that make up the modern data stack.

ETL vs ELT

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when storage was expensive and transformation compute was on dedicated servers outside the warehouse.

ELT (Extract, Load, Transform) loads raw data directly into the warehouse first, then transforms it using the warehouse's own compute. This is the modern standard because cloud storage is cheap, warehouse compute is powerful and cost-effective, and keeping raw data enables rebuilding transformations when requirements change.

The modern data stack: ingestion tool (Fivetran, Airbyte) extracts and loads raw data into the warehouse. dbt transforms it. BI tools (Looker, Metabase, Tableau) query the transformed models.

dbt: SQL Transformations as Code

-- models/marts/revenue_by_month.sql
SELECT
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(order_total) AS total_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

-- tests/schema.yml
# Define data quality tests: unique, not_null, accepted_values

The ref() function creates a dependency between models. dbt builds a DAG of all models, runs them in dependency order, and generates documentation with lineage graphs automatically. dbt tests validate data quality on every run.

Orchestration with Apache Airflow

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from datetime import datetime

with DAG('daily_pipeline', schedule_interval='@daily', start_date=datetime(2026,1,1)) as dag:
    extract = PythonOperator(task_id='extract', python_callable=run_fivetran_sync)
    transform = DbtCloudRunJobOperator(task_id='dbt_run', job_id='12345')
    notify = PythonOperator(task_id='notify', python_callable=send_slack_alert)
    extract >> transform >> notify

Airflow handles retries, alerting, SLA monitoring, and backfilling historical runs. Set max_retries=3 and retry_delay on each task. Monitor DAG health in the Airflow UI.

Real-Time with Kafka

Kafka is a distributed event streaming platform. Producers write events to topics; consumers read and process them. Kafka retains events for a configurable period (days to weeks), enabling multiple independent consumers and event replay.

Common patterns: application events (orders, clicks, signups) streamed to Kafka, consumed by a Flink or Spark Streaming job, aggregated, and written to the warehouse in near real-time. Separate consumer reads the same events and writes to an operational database for real-time dashboards.

Kafka complexity is real: managing partitions, consumer groups, schema registry, and offset management requires operational expertise. Cloud managed options (Confluent Cloud, AWS MSK, Redpanda Cloud) significantly reduce this burden.

Frequently Asked Questions

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse. ELT is the modern standard for cloud data warehouses because storage is cheap, warehouse compute is powerful, and preserving raw data enables rebuilding transformations when requirements change.

What is dbt?

dbt (data build tool) is the standard tool for data transformation in modern data stacks. You write SQL SELECT statements as dbt models; dbt wraps them in CREATE TABLE or VIEW statements and runs them in your data warehouse. dbt manages dependencies between models, generates documentation, runs data quality tests, and provides complete lineage from source tables to final analytics models.

Do I need Apache Kafka for data pipelines?

Not unless you need near real-time data (latency under minutes). Most analytics workloads are well-served by daily or hourly batch pipelines. Use Kafka when: you need to process events in real-time (fraud detection, real-time recommendations), your data volumes exceed what batch ingestion can handle within your latency budget, or you need multiple independent consumers of the same event stream.

What tools does a data engineer need to know in 2026?

Core skills: SQL (essential), Python (for pipeline code and orchestration), dbt (transformation), Airflow or Prefect (orchestration), and one cloud data warehouse (BigQuery, Snowflake, or Redshift). Secondary skills: Kafka (streaming), Spark (large-scale processing), Terraform (infrastructure as code), and at least one ingestion tool (Fivetran or Airbyte). Start with SQL, Python, dbt, and Airflow before adding streaming tools.

Data pipelines are the infrastructure of the data-driven company. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026.

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies.