Key Takeaways
- ELT is the modern standard: Modern cloud warehouses (BigQuery, Snowflake) are powerful enough to run transformations. Load raw data first, then transform inside the warehouse — this preserves source data and makes transformations reversible.
- dbt is the transformation layer: dbt turns SQL SELECT statements into version-controlled, tested, documented warehouse transformations. It is the standard for the T in ELT in 2026.
- Airflow orchestrates everything: Apache Airflow defines pipelines as Python DAGs. It schedules runs, monitors execution, retries failures, and sends alerts. Managed options: MWAA (AWS), Cloud Composer (GCP), Astronomer.
- Streaming requires Kafka: When you need data fresher than hourly batch runs, Kafka is the standard streaming backbone. Producers publish events to topics; consumers process them in near real-time.
A data pipeline is the infrastructure that moves data from where it originates to where it creates value. Modern data engineering has standardized around ELT architectures, SQL-based transformations with dbt, and orchestration with Airflow. This guide covers the patterns and tools that make up the modern data stack.
ETL vs ELT
ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when storage was expensive and transformation compute was on dedicated servers outside the warehouse.
ELT (Extract, Load, Transform) loads raw data directly into the warehouse first, then transforms it using the warehouse's own compute. This is the modern standard because cloud storage is cheap, warehouse compute is powerful and cost-effective, and keeping raw data enables rebuilding transformations when requirements change.
The modern data stack: ingestion tool (Fivetran, Airbyte) extracts and loads raw data into the warehouse. dbt transforms it. BI tools (Looker, Metabase, Tableau) query the transformed models.
dbt: SQL Transformations as Code
-- models/marts/revenue_by_month.sql SELECT DATE_TRUNC(order_date, MONTH) AS month, SUM(order_total) AS total_revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM {{ ref('stg_orders') }} WHERE status = 'completed' GROUP BY 1 -- tests/schema.yml # Define data quality tests: unique, not_null, accepted_values
The ref() function creates a dependency between models. dbt builds a DAG of all models, runs them in dependency order, and generates documentation with lineage graphs automatically. dbt tests validate data quality on every run.
Orchestration with Apache Airflow
from airflow import DAG from airflow.operators.python import PythonOperator from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator from datetime import datetime with DAG('daily_pipeline', schedule_interval='@daily', start_date=datetime(2026,1,1)) as dag: extract = PythonOperator(task_id='extract', python_callable=run_fivetran_sync) transform = DbtCloudRunJobOperator(task_id='dbt_run', job_id='12345') notify = PythonOperator(task_id='notify', python_callable=send_slack_alert) extract >> transform >> notify
Airflow handles retries, alerting, SLA monitoring, and backfilling historical runs. Set max_retries=3 and retry_delay on each task. Monitor DAG health in the Airflow UI.
Real-Time with Kafka
Kafka is a distributed event streaming platform. Producers write events to topics; consumers read and process them. Kafka retains events for a configurable period (days to weeks), enabling multiple independent consumers and event replay.
Common patterns: application events (orders, clicks, signups) streamed to Kafka, consumed by a Flink or Spark Streaming job, aggregated, and written to the warehouse in near real-time. Separate consumer reads the same events and writes to an operational database for real-time dashboards.
Kafka complexity is real: managing partitions, consumer groups, schema registry, and offset management requires operational expertise. Cloud managed options (Confluent Cloud, AWS MSK, Redpanda Cloud) significantly reduce this burden.
Frequently Asked Questions
What is the difference between ETL and ELT?
ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse. ELT is the modern standard for cloud data warehouses because storage is cheap, warehouse compute is powerful, and preserving raw data enables rebuilding transformations when requirements change.
What is dbt?
dbt (data build tool) is the standard tool for data transformation in modern data stacks. You write SQL SELECT statements as dbt models; dbt wraps them in CREATE TABLE or VIEW statements and runs them in your data warehouse. dbt manages dependencies between models, generates documentation, runs data quality tests, and provides complete lineage from source tables to final analytics models.
Do I need Apache Kafka for data pipelines?
Not unless you need near real-time data (latency under minutes). Most analytics workloads are well-served by daily or hourly batch pipelines. Use Kafka when: you need to process events in real-time (fraud detection, real-time recommendations), your data volumes exceed what batch ingestion can handle within your latency budget, or you need multiple independent consumers of the same event stream.
What tools does a data engineer need to know in 2026?
Core skills: SQL (essential), Python (for pipeline code and orchestration), dbt (transformation), Airflow or Prefect (orchestration), and one cloud data warehouse (BigQuery, Snowflake, or Redshift). Secondary skills: Kafka (streaming), Spark (large-scale processing), Terraform (infrastructure as code), and at least one ingestion tool (Fivetran or Airbyte). Start with SQL, Python, dbt, and Airflow before adding streaming tools.
Data pipelines are the infrastructure of the data-driven company. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026.