Pandas Cheat Sheet [2026]: Every Command You Need

In This Guide

  1. Reading Data Into a DataFrame
  2. Selecting and Filtering Data
  3. Cleaning: Nulls, Duplicates, Types
  4. GroupBy and Aggregation
  5. Merging and Joining DataFrames
  6. Time Series Operations
  7. Performance Tips
  8. Frequently Asked Questions

Key Takeaways

Pandas is the first library every data scientist or analyst needs to learn. It loads your data, lets you inspect it, clean it, transform it, aggregate it, and export it — all with a consistent, chainable API.

This cheat sheet covers the commands you will use every day, with copy-paste examples that work on real data.

Reading Data Into a DataFrame

import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', parse_dates=['date_col'], index_col='id')

# Read Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from SQL
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@host/db')
df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine)

# Quick inspection
df.shape          # (rows, cols)
df.dtypes         # column data types
df.describe()     # summary statistics for numeric columns
df.info()         # memory usage + null counts
df.head(10)       # first 10 rows
df.sample(5)      # 5 random rows

Selecting and Filtering Data

# Select columns
df['col']              # Series
df[['col1', 'col2']]  # DataFrame

# loc: label-based selection
df.loc[0]                          # row with label 0
df.loc[0:5, 'col1':'col3']       # rows 0-5, cols col1 to col3
df.loc[df['status'] == 'active'] # boolean filter

# iloc: position-based selection
df.iloc[0]      # first row
df.iloc[:5, :3] # first 5 rows, first 3 columns

# Boolean filtering
df[df['amount'] > 100]
df[(df['amount'] > 100) & (df['status'] == 'active')]
df[df['category'].isin(['A', 'B'])]
df[df['name'].str.contains('Smith', case=False)]

Cleaning: Nulls, Duplicates, Types

# Check for nulls
df.isnull().sum()            # null count per column
df.isnull().sum() / len(df)  # null percentage per column

# Handle nulls
df.dropna()                     # drop rows with ANY null
df.dropna(subset=['col1'])     # drop only where col1 is null
df['col'].fillna(0)           # fill nulls with value
df['col'].fillna(method='ffill') # forward fill
df['col'].fillna(df['col'].median()) # fill with median

# Duplicates
df.duplicated().sum()        # count duplicate rows
df.drop_duplicates()         # remove duplicate rows
df.drop_duplicates(subset=['email']) # dedupe by column

# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df['category'] = df['category'].astype('category') # saves memory

GroupBy and Aggregation

# Basic groupby
df.groupby('category')['amount'].sum()
df.groupby('category')['amount'].agg(['sum', 'mean', 'count'])

# Multiple aggregations with named outputs
df.groupby('category').agg(
    total_amount=('amount', 'sum'),
    avg_amount=('amount', 'mean'),
    order_count=('id', 'count')
).reset_index()

# GroupBy with transform (add aggregate back to original df)
df['category_total'] = df.groupby('category')['amount'].transform('sum')

# Value counts
df['status'].value_counts()
df['status'].value_counts(normalize=True) # as percentages

Merging and Joining DataFrames

# Merge (like SQL JOIN)
merged = pd.merge(df1, df2, on='user_id')         # inner join
merged = pd.merge(df1, df2, on='user_id', how='left') # left join
merged = pd.merge(df1, df2, left_on='id', right_on='user_id')

# Concatenate
combined = pd.concat([df1, df2])          # stack vertically
combined = pd.concat([df1, df2], axis=1)  # concatenate columns

# Pivot table
pivot = df.pivot_table(
    values='amount',
    index='category',
    columns='month',
    aggfunc='sum',
    fill_value=0
)

Time Series Operations

# Parse dates and set as index
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

# Resample: aggregate by time period
df.resample('M').sum()   # monthly totals
df.resample('W').mean()  # weekly averages
df.resample('Q').agg({'revenue': 'sum', 'customers': 'count'})

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter

# Rolling averages
df['7day_avg'] = df['revenue'].rolling(7).mean()
df['30day_avg'] = df['revenue'].rolling(30).mean()

Frequently Asked Questions

What is the difference between loc and iloc in Pandas?

loc selects rows and columns by label (index names, column names). iloc selects by integer position (0, 1, 2...). Use loc when you know the index or column names. Use iloc when you want to select by position regardless of labels. The key difference: loc includes the end of a slice (df.loc[0:5] includes row 5), while iloc excludes it (df.iloc[0:5] returns rows 0-4).

How do I handle missing values in Pandas?

Use df.isnull().sum() to find null counts per column. Then choose a strategy: dropna() to remove rows with nulls, fillna(value) to fill with a constant, fillna(df[col].mean()) to fill with the column mean, or fillna(method='ffill') to forward-fill from the previous non-null value. The right strategy depends on why data is missing — do not blindly drop rows or fill with zeros without understanding the data.

Why is my Pandas code slow?

Common causes: using .apply() with a Python function (10-100x slower than vectorized operations), reading a large CSV without specifying dtypes (wastes memory and time), loading more columns than needed (use usecols parameter), and not using categorical dtype for low-cardinality string columns. For large datasets (>1 GB), consider Polars, DuckDB, or Dask as faster alternatives.

How do I save a DataFrame to CSV or Excel?

Use df.to_csv('output.csv', index=False) to save without the row index. Use df.to_excel('output.xlsx', index=False) for Excel. For large files, use df.to_parquet('output.parquet') — Parquet is 2-10x smaller than CSV, much faster to read, and preserves data types.

Data manipulation is the first skill every analyst needs. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026. Verify details directly with relevant sources.

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies. He founded Precision AI Academy to bridge the gap between AI theory and real-world application.