Course Overview All Courses Blog Reserve Bootcamp Seat
Pandas Masterclass · Day 1 of 5 ~60 minutes

DataFrames — Loading, Inspecting, Selecting

Get data into pandas, understand what you're working with, and pull out exactly the rows and columns you need.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build Today

A data exploration script that loads a sales CSV, prints a full diagnostic summary (shape, dtypes, null counts, top values), then filters and exports two subsets: high-value orders and orders from a specific region.

1
Setup

Installing pandas

If you haven't already, install pandas in your Python environment:

bash
pip install pandas openpyxl

The openpyxl package lets pandas read Excel files. You'll need it on Day 1.

Check your version: Run python -c "import pandas as pd; print(pd.__version__)" — you want 1.5 or higher. 2.x is ideal.

2
Loading Data

Reading files into DataFrames

A DataFrame is a table — rows and columns — held in memory. You create one by reading a file.

pythonload.py
import pandas as pd

# CSV — the most common format
df = pd.read_csv("sales.csv")

# CSV with custom separator (semicolons, tabs, etc.)
df = pd.read_csv("data.csv", sep=";")

# Excel — specify the sheet name
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")

# JSON — works for both arrays and nested objects
df = pd.read_json("orders.json")

# From a URL (no download needed)
url = "https://raw.githubusercontent.com/.../data.csv"
df = pd.read_csv(url)

All of these return the same thing: a DataFrame. The API is consistent no matter where the data comes from.

Quick tip: read_csv accepts file paths, URLs, and even file-like objects. If your data is behind an API, you can fetch it with requests and pass the content directly to pandas.

3
Inspection

Understanding what you loaded

Before you do anything with data, you need to know what you have. These five commands tell you everything:

pythoninspect.py
# Shape: (rows, columns)
print(df.shape)         # (10420, 8)

# First 5 rows
print(df.head())

# Column names, data types, and non-null counts
df.info()

# Numeric summary: mean, std, min, max, quartiles
print(df.describe())

# Count nulls per column
print(df.isnull().sum())

Run these on any new dataset before touching anything else. They tell you: how big is this? what types are the columns? where are the holes?

pythoninspect2.py
# Value counts for a categorical column
print(df["region"].value_counts())

# Unique values in a column
print(df["status"].unique())

# Data types of every column
print(df.dtypes)

# Percentage of nulls per column
print(df.isnull().mean() * 100)
4
Selection

loc, iloc, and boolean filtering

Three ways to pull data from a DataFrame. Each has a specific use case.

Column selection

python
# Single column → returns a Series
revenue = df["revenue"]

# Multiple columns → returns a DataFrame
subset = df[["order_id", "revenue", "region"]]

loc — label-based selection

python
# loc[row_label, column_label]
# Select rows 0–4, all columns
df.loc[0:4]

# Select specific rows and columns by name
df.loc[0:4, ["order_id", "revenue"]]

# Select a single cell
df.loc[5, "revenue"]

iloc — position-based selection

python
# iloc[row_position, column_position] — uses integers
df.iloc[0:5]           # first 5 rows
df.iloc[:, 0:3]         # all rows, first 3 columns
df.iloc[-5:]            # last 5 rows

Boolean filtering — the one you'll use most

python
# Rows where revenue > 1000
high_value = df[df["revenue"] > 1000]

# Multiple conditions — use & (and) and | (or)
west_high = df[(df["region"] == "West") & (df["revenue"] > 500)]

# Filter by list of values
key_regions = df[df["region"].isin(["West", "East"])]

# Filter rows where a column contains a string
returns = df[df["status"].str.contains("returned", case=False)]

Common mistake: When combining conditions, always wrap each condition in parentheses. df[a == 1 & b == 2] will give a cryptic error. df[(a == 1) & (b == 2)] works.

Practice Challenge

Build a data diagnostic script

Download any CSV from Kaggle or use your own data. Write a script that:

  • Loads the file and prints shape, column names, and dtypes
  • Prints the count and percentage of nulls for every column
  • Filters to rows where a numeric column is above its mean
  • Saves those rows to a new CSV with df.to_csv("output.csv", index=False)

What You Learned Today

  • Load CSV, Excel, and JSON files into DataFrames
  • Inspect shape, dtypes, nulls, and value distributions
  • Select columns by name with bracket notation
  • Use loc (label-based) and iloc (position-based) indexing
  • Filter rows with boolean masks and isin()
Course Progress
Day 1 of 5 — 20%
Day 1 Complete

Day 2: Cleaning Data

Tomorrow you'll handle real-world messy data: missing values, duplicates, and wrong data types.

Start Day 2
Finished this lesson?