Pandas Cheat Sheet [2026]: Every Command You Need

Key Takeaways

Pandas is the foundation: Pandas is the most-used data manipulation library in Python. Every data scientist and analyst needs to know it — it is the first tool you reach for when data arrives as a CSV, Excel file, or SQL query result.
loc vs iloc: loc selects by label (row/column names). iloc selects by integer position. These are the two most important selection methods and the most commonly confused by beginners.
Apply is slow — use vectorization: Using .apply() with a Python function is often 10-100x slower than the equivalent vectorized operation. Before using apply, check if a built-in Pandas or NumPy function does the same thing.
Chaining is readable but chain carefully: Method chaining (df.dropna().reset_index().rename(...)) is readable but can hide bugs. Assign to variables at key checkpoints and use .copy() to avoid SettingWithCopyWarning.

Pandas is the first library every data scientist or analyst needs to learn. It loads your data, lets you inspect it, clean it, transform it, aggregate it, and export it — all with a consistent, chainable API.

This cheat sheet covers the commands you will use every day, with copy-paste examples that work on real data.

Reading Data Into a DataFrame

import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', parse_dates=['date_col'], index_col='id')

# Read Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from SQL
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@host/db')
df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine)

# Quick inspection
df.shape          # (rows, cols)
df.dtypes         # column data types
df.describe()     # summary statistics for numeric columns
df.info()         # memory usage + null counts
df.head(10)       # first 10 rows
df.sample(5)      # 5 random rows

Selecting and Filtering Data

Learn the Core Concepts

Start with the fundamentals before touching tools. Understanding why something was built the way it was makes every tool decision faster and more defensible.

Concepts first, syntax second

Build Something Real

The fastest way to learn is to build a project that produces a real output — something you can show, share, or deploy. Toy examples teach you the happy path; real projects teach you everything else.

Ship something, then iterate

Know the Trade-offs

Every technology choice is a trade-off. The engineers who advance fastest are the ones who can articulate clearly why they chose one approach over another — not just "I used it before."

Explain the why, not just the what

Go to Production

Development is the easy part. The real learning happens when you deploy, monitor, debug, and scale. Plan for production from day one.

Dev is a warm-up, prod is the game

# Select columns
df['col']              # Series
df[['col1', 'col2']]  # DataFrame

# loc: label-based selection
df.loc[0]                          # row with label 0
df.loc[0:5, 'col1':'col3']       # rows 0-5, cols col1 to col3
df.loc[df['status'] == 'active'] # boolean filter

# iloc: position-based selection
df.iloc[0]      # first row
df.iloc[:5, :3] # first 5 rows, first 3 columns

# Boolean filtering
df[df['amount'] > 100]
df[(df['amount'] > 100) & (df['status'] == 'active')]
df[df['category'].isin(['A', 'B'])]
df[df['name'].str.contains('Smith', case=False)]

Cleaning: Nulls, Duplicates, Types

# Check for nulls
df.isnull().sum()            # null count per column
df.isnull().sum() / len(df)  # null percentage per column

# Handle nulls
df.dropna()                     # drop rows with ANY null
df.dropna(subset=['col1'])     # drop only where col1 is null
df['col'].fillna(0)           # fill nulls with value
df['col'].fillna(method='ffill') # forward fill
df['col'].fillna(df['col'].median()) # fill with median

# Duplicates
df.duplicated().sum()        # count duplicate rows
df.drop_duplicates()         # remove duplicate rows
df.drop_duplicates(subset=['email']) # dedupe by column

# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df['category'] = df['category'].astype('category') # saves memory

GroupBy and Aggregation

# Basic groupby
df.groupby('category')['amount'].sum()
df.groupby('category')['amount'].agg(['sum', 'mean', 'count'])

# Multiple aggregations with named outputs
df.groupby('category').agg(
    total_amount=('amount', 'sum'),
    avg_amount=('amount', 'mean'),
    order_count=('id', 'count')
).reset_index()

# GroupBy with transform (add aggregate back to original df)
df['category_total'] = df.groupby('category')['amount'].transform('sum')

# Value counts
df['status'].value_counts()
df['status'].value_counts(normalize=True) # as percentages

Merging and Joining DataFrames

# Merge (like SQL JOIN)
merged = pd.merge(df1, df2, on='user_id')         # inner join
merged = pd.merge(df1, df2, on='user_id', how='left') # left join
merged = pd.merge(df1, df2, left_on='id', right_on='user_id')

# Concatenate
combined = pd.concat([df1, df2])          # stack vertically
combined = pd.concat([df1, df2], axis=1)  # concatenate columns

# Pivot table
pivot = df.pivot_table(
    values='amount',
    index='category',
    columns='month',
    aggfunc='sum',
    fill_value=0
)

Time Series Operations

# Parse dates and set as index
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

# Resample: aggregate by time period
df.resample('M').sum()   # monthly totals
df.resample('W').mean()  # weekly averages
df.resample('Q').agg({'revenue': 'sum', 'customers': 'count'})

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter

# Rolling averages
df['7day_avg'] = df['revenue'].rolling(7).mean()
df['30day_avg'] = df['revenue'].rolling(30).mean()

Frequently Asked Questions

What is the difference between loc and iloc in Pandas?

loc selects rows and columns by label (index names, column names). iloc selects by integer position (0, 1, 2...). Use loc when you know the index or column names. Use iloc when you want to select by position regardless of labels. The key difference: loc includes the end of a slice (df.loc[0:5] includes row 5), while iloc excludes it (df.iloc[0:5] returns rows 0-4).

How do I handle missing values in Pandas?

Use df.isnull().sum() to find null counts per column. Then choose a strategy: dropna() to remove rows with nulls, fillna(value) to fill with a constant, fillna(df[col].mean()) to fill with the column mean, or fillna(method='ffill') to forward-fill from the previous non-null value. The right strategy depends on why data is missing — do not blindly drop rows or fill with zeros without understanding the data.

Why is my Pandas code slow?

Common causes: using .apply() with a Python function (10-100x slower than vectorized operations), reading a large CSV without specifying dtypes (wastes memory and time), loading more columns than needed (use usecols parameter), and not using categorical dtype for low-cardinality string columns. For large datasets (>1 GB), consider Polars, DuckDB, or Dask as faster alternatives.

How do I save a DataFrame to CSV or Excel?

Use df.to_csv('output.csv', index=False) to save without the row index. Use df.to_excel('output.xlsx', index=False) for Excel. For large files, use df.to_parquet('output.parquet') — Parquet is 2-10x smaller than CSV, much faster to read, and preserves data types.

The Verdict

Master this topic and you have a real production skill. The best way to lock it in is hands-on practice with real tools and real feedback — exactly what we build at Precision AI Academy.

Data manipulation is the first skill every analyst needs. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. June–October 2026 (Thu–Fri). Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026. Verify details directly with relevant sources.

Our Take

Pandas is still essential, but Polars is the better default for new projects.

Pandas isn't going anywhere — it's too deeply embedded in the data science ecosystem, and every data analyst already knows it. But for new data projects in 2026, Polars deserves serious consideration as the default DataFrame library. Polars is built in Rust, runs operations in parallel across CPU cores by default, uses lazy evaluation to avoid materializing intermediate results, and consistently benchmarks 5–20x faster than pandas on large datasets. The API is similar enough that switching isn't traumatic, and its query optimizer makes common groupby/join patterns significantly faster without any code changes.

The practical case for still learning pandas first: the entire ML/data science library ecosystem — scikit-learn, statsmodels, seaborn, matplotlib, and most data loading utilities — expects pandas DataFrames. Interoperability friction is real even in 2026. Polars has a to_pandas() method, but that round-trip defeats the performance advantage. For exploratory analysis in Jupyter notebooks on datasets under a few million rows, the performance difference is rarely worth switching toolchains. Where Polars pays off clearly is production ETL pipelines processing gigabytes of data regularly — the memory and speed advantages compound at that scale.

Our recommendation: learn pandas to fluency because the interview questions and job descriptions require it. Add Polars to your toolkit once you're comfortable with DataFrame operations conceptually — the mental model transfers cleanly, and having both available makes you more effective on larger-scale data work.

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts