In This Guide
Key Takeaways
- Pandas is the foundation: Pandas is the most-used data manipulation library in Python. Every data scientist and analyst needs to know it — it is the first tool you reach for when data arrives as a CSV, Excel file, or SQL query result.
- loc vs iloc: loc selects by label (row/column names). iloc selects by integer position. These are the two most important selection methods and the most commonly confused by beginners.
- Apply is slow — use vectorization: Using .apply() with a Python function is often 10-100x slower than the equivalent vectorized operation. Before using apply, check if a built-in Pandas or NumPy function does the same thing.
- Chaining is readable but chain carefully: Method chaining (df.dropna().reset_index().rename(...)) is readable but can hide bugs. Assign to variables at key checkpoints and use .copy() to avoid SettingWithCopyWarning.
Pandas is the first library every data scientist or analyst needs to learn. It loads your data, lets you inspect it, clean it, transform it, aggregate it, and export it — all with a consistent, chainable API.
This cheat sheet covers the commands you will use every day, with copy-paste examples that work on real data.
Reading Data Into a DataFrame
import pandas as pd # Read CSV df = pd.read_csv('data.csv') df = pd.read_csv('data.csv', parse_dates=['date_col'], index_col='id') # Read Excel df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # Read from SQL import sqlalchemy engine = sqlalchemy.create_engine('postgresql://user:pass@host/db') df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine) # Quick inspection df.shape # (rows, cols) df.dtypes # column data types df.describe() # summary statistics for numeric columns df.info() # memory usage + null counts df.head(10) # first 10 rows df.sample(5) # 5 random rows
Selecting and Filtering Data
# Select columns df['col'] # Series df[['col1', 'col2']] # DataFrame # loc: label-based selection df.loc[0] # row with label 0 df.loc[0:5, 'col1':'col3'] # rows 0-5, cols col1 to col3 df.loc[df['status'] == 'active'] # boolean filter # iloc: position-based selection df.iloc[0] # first row df.iloc[:5, :3] # first 5 rows, first 3 columns # Boolean filtering df[df['amount'] > 100] df[(df['amount'] > 100) & (df['status'] == 'active')] df[df['category'].isin(['A', 'B'])] df[df['name'].str.contains('Smith', case=False)]
Cleaning: Nulls, Duplicates, Types
# Check for nulls df.isnull().sum() # null count per column df.isnull().sum() / len(df) # null percentage per column # Handle nulls df.dropna() # drop rows with ANY null df.dropna(subset=['col1']) # drop only where col1 is null df['col'].fillna(0) # fill nulls with value df['col'].fillna(method='ffill') # forward fill df['col'].fillna(df['col'].median()) # fill with median # Duplicates df.duplicated().sum() # count duplicate rows df.drop_duplicates() # remove duplicate rows df.drop_duplicates(subset=['email']) # dedupe by column # Type conversion df['date'] = pd.to_datetime(df['date']) df['amount'] = pd.to_numeric(df['amount'], errors='coerce') df['category'] = df['category'].astype('category') # saves memory
GroupBy and Aggregation
# Basic groupby df.groupby('category')['amount'].sum() df.groupby('category')['amount'].agg(['sum', 'mean', 'count']) # Multiple aggregations with named outputs df.groupby('category').agg( total_amount=('amount', 'sum'), avg_amount=('amount', 'mean'), order_count=('id', 'count') ).reset_index() # GroupBy with transform (add aggregate back to original df) df['category_total'] = df.groupby('category')['amount'].transform('sum') # Value counts df['status'].value_counts() df['status'].value_counts(normalize=True) # as percentages
Merging and Joining DataFrames
# Merge (like SQL JOIN) merged = pd.merge(df1, df2, on='user_id') # inner join merged = pd.merge(df1, df2, on='user_id', how='left') # left join merged = pd.merge(df1, df2, left_on='id', right_on='user_id') # Concatenate combined = pd.concat([df1, df2]) # stack vertically combined = pd.concat([df1, df2], axis=1) # concatenate columns # Pivot table pivot = df.pivot_table( values='amount', index='category', columns='month', aggfunc='sum', fill_value=0 )
Time Series Operations
# Parse dates and set as index df['date'] = pd.to_datetime(df['date']) df = df.set_index('date') # Resample: aggregate by time period df.resample('M').sum() # monthly totals df.resample('W').mean() # weekly averages df.resample('Q').agg({'revenue': 'sum', 'customers': 'count'}) # Extract date components df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.dayofweek df['quarter'] = df['date'].dt.quarter # Rolling averages df['7day_avg'] = df['revenue'].rolling(7).mean() df['30day_avg'] = df['revenue'].rolling(30).mean()
Frequently Asked Questions
What is the difference between loc and iloc in Pandas?
loc selects rows and columns by label (index names, column names). iloc selects by integer position (0, 1, 2...). Use loc when you know the index or column names. Use iloc when you want to select by position regardless of labels. The key difference: loc includes the end of a slice (df.loc[0:5] includes row 5), while iloc excludes it (df.iloc[0:5] returns rows 0-4).
How do I handle missing values in Pandas?
Use df.isnull().sum() to find null counts per column. Then choose a strategy: dropna() to remove rows with nulls, fillna(value) to fill with a constant, fillna(df[col].mean()) to fill with the column mean, or fillna(method='ffill') to forward-fill from the previous non-null value. The right strategy depends on why data is missing — do not blindly drop rows or fill with zeros without understanding the data.
Why is my Pandas code slow?
Common causes: using .apply() with a Python function (10-100x slower than vectorized operations), reading a large CSV without specifying dtypes (wastes memory and time), loading more columns than needed (use usecols parameter), and not using categorical dtype for low-cardinality string columns. For large datasets (>1 GB), consider Polars, DuckDB, or Dask as faster alternatives.
How do I save a DataFrame to CSV or Excel?
Use df.to_csv('output.csv', index=False) to save without the row index. Use df.to_excel('output.xlsx', index=False) for Excel. For large files, use df.to_parquet('output.parquet') — Parquet is 2-10x smaller than CSV, much faster to read, and preserves data types.
Data manipulation is the first skill every analyst needs. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026. Verify details directly with relevant sources.