A script that combines any number of monthly CSV files into one clean dataset, removes duplicates, standardizes inconsistent formatting, and exports a ready-to-analyze file — automatically.
Install openpyxl to Work with Excel Files
Python can read .xlsx files natively with one library. Install it:
$ pip install openpyxl
You'll use this to read actual Excel workbooks. Tomorrow we add pandas, which makes this even simpler — but today you learn the foundation.
Combine 50 CSV Files Into One
In Excel: open each file, copy data, paste into master sheet, repeat 50 times. In Python: one loop.
First, create a folder called monthly_data/ and put a few CSV files in it. Then:
import csv
import os
from pathlib import Path
# Find all CSV files in the folder
data_folder = Path("monthly_data")
csv_files = list(data_folder.glob("*.csv"))
print(f"Found {len(csv_files)} files")
all_rows = []
headers = None
for filepath in csv_files:
with open(filepath) as f:
reader = csv.DictReader(f)
if headers is None:
headers = reader.fieldnames
for row in reader:
row["source_file"] = filepath.name # track origin
all_rows.append(row)
print(f"Combined {len(all_rows)} total rows")
# Write combined file
with open("combined.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=headers + ["source_file"])
writer.writeheader()
writer.writerows(all_rows)
print("Saved to combined.csv")
This works on 5 files or 500. The loop doesn't care.
Automated Data Cleaning
Real data is messy. Names with inconsistent capitalization. Phone numbers in 12 different formats. Blank rows. Excel handles this manually. Python handles it automatically, every time.
import csv
import re
def clean_phone(phone):
# Strip everything except digits
digits = re.sub(r'\D', '', str(phone))
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone # return original if can't format
def clean_row(row):
return {
"name": row["name"].strip().title(), # "JOHN DOE" → "John Doe"
"email": row["email"].strip().lower(), # standardize email case
"phone": clean_phone(row["phone"]), # normalize phone format
"revenue": row["revenue"].replace("$","").replace(",","").strip()
}
with open("messy_data.csv") as f:
rows = list(csv.DictReader(f))
# Remove empty rows, apply cleaning
cleaned = [clean_row(r) for r in rows if r["name"].strip()]
print(f"Cleaned {len(cleaned)} rows (removed {len(rows)-len(cleaned)} empty)")
This runs in milliseconds on 100,000 rows. The same cleaning in Excel — opening Find & Replace, running macros, manually checking — takes hours and introduces human error every time.
Process Files Too Large for Excel
Excel's 1,048,576 row limit is not the ceiling in Python. Python streams files — it reads one chunk at a time without loading everything into memory.
import csv
from collections import defaultdict
# Process a 10M row file without loading it all into memory
totals_by_region = defaultdict(float)
row_count = 0
with open("huge_sales_file.csv") as f:
for row in csv.DictReader(f):
totals_by_region[row["region"]] += float(row["revenue"])
row_count += 1
print(f"Processed {row_count:,} rows")
for region, total in sorted(totals_by_region.items()):
print(f"{region}: ${total:,.0f}")
This reads the file one row at a time. Your RAM never holds the whole thing. Tomorrow, pandas makes this even cleaner — and handles all the edge cases automatically.
Day 2 Complete
- Combined multiple CSV files into one master dataset automatically
- Built a data cleaning pipeline that runs in milliseconds
- Processed files too large for Excel to open
- Understand why Python is not just "Excel but harder" — it solves problems Excel can't
Tomorrow: Pandas — Excel But 100x More Powerful
Day 3 introduces the library that data scientists live in. Every Excel operation you know — faster, more powerful, and scriptable.
Day 3: Pandas