In This Guide
Key Takeaways
- Audit first, clean second: Before modifying anything, profile your data. Know how many nulls exist, what the value distributions look like, and what data types are stored where. Cleaning without auditing creates new problems while fixing old ones.
- Missing values are not always errors: A null value can mean the data was not collected, the field does not apply (no second address line), or the event has not happened yet (no cancellation date for an active subscription). The right handling depends on what null means in context.
- Outliers require judgment: An outlier might be a data entry error (age=999), a legitimate extreme value (a billion-dollar transaction), or a signal worth investigating (anomalously high server error rate). Never delete outliers without understanding them.
- Document every cleaning decision: Data cleaning decisions are business decisions. Document why you handled each issue the way you did. Future analysts (including your future self) need to understand the assumptions behind a cleaned dataset.
Data scientists spend approximately 80% of their time cleaning and preparing data. This is not a complaint — it is a description of reality that every experienced practitioner accepts. Raw data from the real world is messy: fields have missing values, categories are inconsistently spelled, dates are in six different formats, and record IDs do not match across tables.
The good news: data cleaning is a learnable skill with consistent patterns. This guide gives you the systematic approach.
Why Data Cleaning Takes 80% of Your Time
Real-world data has problems that laboratory-created toy datasets do not. It comes from systems built by different teams over different years, exported by different processes, entered by humans who make mistakes, and merged from sources with incompatible schemas.
Common data quality problems by category:
- Missing values: Nulls, empty strings, "N/A", "Unknown", 0 used as a proxy for null
- Inconsistent formats: Dates as "01/15/2026", "January 15, 2026", "2026-01-15", and "15-Jan-26" in the same column
- Duplicate records: The same customer appears twice with slightly different name spellings
- Outliers and errors: Age = 999, negative revenue, quantity = 0 for a line item that should not exist
- Inconsistent categories: "United States", "US", "USA", "U.S.A." all representing the same country
- Type mismatches: Numeric values stored as strings because one row had a "$" prefix
- Referential integrity violations: Orders referencing customer IDs that do not exist
Step 1: Audit Your Data Before Touching It
Profile the dataset before making any changes. This establishes a baseline and prevents "cleaning" that introduces new problems.
import pandas as pd df = pd.read_csv('raw_data.csv') # Shape and types print(df.shape) print(df.dtypes) # Null counts and percentages null_report = pd.DataFrame({ 'null_count': df.isnull().sum(), 'null_pct': (df.isnull().sum() / len(df) * 100).round(2) }) print(null_report[null_report['null_count'] > 0]) # Unique values for categorical columns for col in df.select_dtypes(include='object').columns: print(f" {col}: {df[col].nunique()} unique values") print(df[col].value_counts().head(10)) # Numeric summary print(df.describe())
The audit tells you where to focus: columns with >50% nulls may need to be dropped entirely. Columns with unexpected value distributions may have encoding issues. Columns with suspiciously round numbers (all multiples of 100) may be imputed defaults rather than real measurements.
Handling Missing Values
The appropriate handling for missing values depends on the mechanism of missingness: Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR). The mechanism determines whether dropping or imputing is appropriate.
MCAR: The probability of being missing is unrelated to observed or unobserved values. Safe to drop or impute. Example: random system errors that caused some records to not be logged.
MAR: The probability of being missing depends on observed data but not the missing value itself. Imputation with observed variables is appropriate. Example: older customers are less likely to have email addresses recorded.
MNAR: The probability of being missing depends on the missing value itself. Neither dropping nor simple imputation is appropriate — it introduces bias. Example: high-income customers may not disclose income.
Practical handling strategies:
- Drop the column: If >70-80% of values are missing and the column is not critical, drop it
- Drop the row: If a small percentage of rows are missing the target variable or a critical feature
- Fill with mean/median: For numeric features with random missingness (use median for skewed distributions)
- Fill with mode: For categorical features
- Use a placeholder category: "Unknown" for categorical features where missing-ness itself is informative
- KNN or regression imputation: For features where other columns predict the missing value
Detecting and Handling Outliers
An outlier is a value that differs significantly from the rest of the distribution. Before handling outliers, understand what they mean in context.
# IQR method: flag values beyond 1.5*IQR from quartiles Q1 = df['amount'].quantile(0.25) Q3 = df['amount'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['amount'] < Q1 - 1.5*IQR) | (df['amount'] > Q3 + 1.5*IQR)] print(f"Outliers: {len(outliers)} rows ({len(outliers)/len(df)*100:.1f}%)") # Z-score method: flag values beyond 3 standard deviations from scipy import stats z_scores = stats.zscore(df['amount'].dropna()) outliers_z = df['amount'][abs(z_scores) > 3]
Handling options:
- Investigate: Look at the specific rows. Are they data entry errors (age=999) or valid extreme values?
- Delete: Only if they are demonstrably errors
- Cap (winsorize): Replace values above the 99th percentile with the 99th percentile value. Preserves the rows but limits their influence.
- Log transform: Log(x+1) compresses skewed distributions with extreme values
- Keep and document: Valid extreme values provide real signal — note them and keep them
Standardizing Formats and Categories
# Standardize text: lowercase, strip whitespace df['country'] = df['country'].str.lower().str.strip() # Fix date formats: parse everything to datetime df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True, errors='coerce') # Map inconsistent categories to standard values country_map = { 'us': 'united states', 'usa': 'united states', 'u.s.a.': 'united states', 'united states of america': 'united states' } df['country'] = df['country'].replace(country_map) # Remove currency symbols and convert to numeric df['revenue'] = df['revenue'].str.replace('[$,]', '', regex=True).astype(float) # Validate email format email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' df['valid_email'] = df['email'].str.match(email_pattern)
Frequently Asked Questions
How do I handle missing values in a dataset?
First, profile the missingness: how many nulls per column, and is the missing data random or structured? Then choose a strategy: drop the column if >70% missing, drop the row if target variable is missing, fill with mean/median for numeric MCAR data, fill with mode for categoricals, or use KNN imputation for correlated features. Document your decisions.
What is the best way to detect outliers?
The IQR method (values outside 1.5*IQR below Q1 or above Q3) works well for skewed distributions. Z-scores (values beyond 3 standard deviations) work well for normally distributed data. Visualization (box plots, scatter plots) is always the most informative approach. Use isolation forests for multivariate outlier detection.
Should I always remove outliers?
No. Remove outliers only when you have evidence they are data errors (impossible values, transcription errors). Legitimate extreme values contain real signal and removing them biases your analysis. Cap (winsorize) outliers when you want to reduce their influence without removing the observation entirely. Document any outlier handling decisions.
How do I handle duplicate rows?
Use df.duplicated().sum() to count duplicates and df.drop_duplicates() to remove them. For fuzzy duplicates (same customer with slightly different name spelling), use fuzzy string matching (rapidfuzz library) or deduplicate on a reliable key like email or phone number. Always investigate why duplicates exist before removing them — some are legitimate (same customer made two purchases) and some are errors.
Clean data is the foundation of every insight. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026. Verify details directly with relevant sources.