A data exploration script that loads a sales CSV, prints a full diagnostic summary (shape, dtypes, null counts, top values), then filters and exports two subsets: high-value orders and orders from a specific region.
Installing pandas
If you haven't already, install pandas in your Python environment:
pip install pandas openpyxl
The openpyxl package lets pandas read Excel files. You'll need it on Day 1.
Check your version: Run python -c "import pandas as pd; print(pd.__version__)" — you want 1.5 or higher. 2.x is ideal.
Reading files into DataFrames
A DataFrame is a table — rows and columns — held in memory. You create one by reading a file.
import pandas as pd
# CSV — the most common format
df = pd.read_csv("sales.csv")
# CSV with custom separator (semicolons, tabs, etc.)
df = pd.read_csv("data.csv", sep=";")
# Excel — specify the sheet name
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")
# JSON — works for both arrays and nested objects
df = pd.read_json("orders.json")
# From a URL (no download needed)
url = "https://raw.githubusercontent.com/.../data.csv"
df = pd.read_csv(url)
All of these return the same thing: a DataFrame. The API is consistent no matter where the data comes from.
Quick tip: read_csv accepts file paths, URLs, and even file-like objects. If your data is behind an API, you can fetch it with requests and pass the content directly to pandas.
Understanding what you loaded
Before you do anything with data, you need to know what you have. These five commands tell you everything:
# Shape: (rows, columns)
print(df.shape) # (10420, 8)
# First 5 rows
print(df.head())
# Column names, data types, and non-null counts
df.info()
# Numeric summary: mean, std, min, max, quartiles
print(df.describe())
# Count nulls per column
print(df.isnull().sum())
Run these on any new dataset before touching anything else. They tell you: how big is this? what types are the columns? where are the holes?
# Value counts for a categorical column
print(df["region"].value_counts())
# Unique values in a column
print(df["status"].unique())
# Data types of every column
print(df.dtypes)
# Percentage of nulls per column
print(df.isnull().mean() * 100)
loc, iloc, and boolean filtering
Three ways to pull data from a DataFrame. Each has a specific use case.
Column selection
# Single column → returns a Series
revenue = df["revenue"]
# Multiple columns → returns a DataFrame
subset = df[["order_id", "revenue", "region"]]
loc — label-based selection
# loc[row_label, column_label]
# Select rows 0–4, all columns
df.loc[0:4]
# Select specific rows and columns by name
df.loc[0:4, ["order_id", "revenue"]]
# Select a single cell
df.loc[5, "revenue"]
iloc — position-based selection
# iloc[row_position, column_position] — uses integers
df.iloc[0:5] # first 5 rows
df.iloc[:, 0:3] # all rows, first 3 columns
df.iloc[-5:] # last 5 rows
Boolean filtering — the one you'll use most
# Rows where revenue > 1000
high_value = df[df["revenue"] > 1000]
# Multiple conditions — use & (and) and | (or)
west_high = df[(df["region"] == "West") & (df["revenue"] > 500)]
# Filter by list of values
key_regions = df[df["region"].isin(["West", "East"])]
# Filter rows where a column contains a string
returns = df[df["status"].str.contains("returned", case=False)]
Common mistake: When combining conditions, always wrap each condition in parentheses. df[a == 1 & b == 2] will give a cryptic error. df[(a == 1) & (b == 2)] works.
Build a data diagnostic script
Download any CSV from Kaggle or use your own data. Write a script that:
- Loads the file and prints shape, column names, and dtypes
- Prints the count and percentage of nulls for every column
- Filters to rows where a numeric column is above its mean
- Saves those rows to a new CSV with
df.to_csv("output.csv", index=False)
What You Learned Today
- Load CSV, Excel, and JSON files into DataFrames
- Inspect shape, dtypes, nulls, and value distributions
- Select columns by name with bracket notation
- Use loc (label-based) and iloc (position-based) indexing
- Filter rows with boolean masks and isin()
Day 2: Cleaning Data
Tomorrow you'll handle real-world messy data: missing values, duplicates, and wrong data types.
Start Day 2