Key Takeaways
- Regex is a pattern language for matching, extracting, and transforming text
- Most regex tasks use fewer than 10 metacharacters: . * + ? [] {} ^ $ | \
- Use raw strings in Python (r'pattern') to avoid backslash conflicts
- Named capture groups make complex patterns readable and maintainable
- Test regex interactively at regex101.com before adding it to code
Regular expressions look terrifying at first glance. A pattern like ^[\w.+-]+@[\w-]+\.[\w.-]+$ seems like someone fell asleep on a keyboard. But regex has simple rules, and once you know about 10 metacharacters, you can write patterns that would otherwise require 50 lines of string manipulation code. This guide takes you from zero to productive in one read.
Regex Basics: The Core Metacharacters
Literal characters match themselves. cat matches the string "cat" anywhere in the input. Metacharacters have special meaning: . matches any single character (except newline). * matches 0 or more of the preceding element. + matches 1 or more. ? matches 0 or 1 (makes the preceding element optional). {n} matches exactly n times. {n,m} matches between n and m times. ^ anchors to start of string. $ anchors to end. | means OR. \ escapes metacharacters (so \. matches a literal dot, not any character).
Character Classes and Shorthand
Square brackets define a set of characters to match. [aeiou] matches any vowel. [a-z] matches any lowercase letter. [0-9] matches any digit. [^abc] matches any character NOT in the set. Shorthand classes: \d matches any digit (equivalent to [0-9]). \w matches word characters (letters, digits, underscore). \s matches whitespace (space, tab, newline). Uppercase versions invert: \D matches non-digits, \W matches non-word characters, \S matches non-whitespace. These shorthand classes are the workhorses of most regex patterns.
Groups and Capture: Extract What You Need
Parentheses create groups. (\d{4})-(\d{2})-(\d{2}) matches a date like 2026-04-10 and captures year, month, and day separately. In Python: match.group(1) returns the first captured group. Named groups are cleaner for complex patterns:
import re
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, '2026-04-10')
if match:
print(match.group('year')) # 2026
print(match.group('month')) # 04Non-capturing groups (?:pattern) group without capturing — useful for applying quantifiers without storing the match.
Regex in Python: re Module Essentials
The Python re module has five functions you'll use constantly: re.search(pattern, string) — find first match anywhere in string. re.match(pattern, string) — match only at the start. re.findall(pattern, string) — return list of all matches. re.sub(pattern, replacement, string) — replace matches. re.compile(pattern) — compile pattern for reuse (faster when using same pattern many times). Always use raw strings: r'\d+' not '\\d+'. The re.IGNORECASE flag makes matching case-insensitive. re.MULTILINE makes ^ and $ match line starts/ends, not just string start/end.
Common Regex Patterns You Can Use Right Now
Email (simplified): [\w.+-]+@[\w-]+\.[\w.]+. US phone number: \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}. URL: https?://[\w./-]+. IP address: \d{1,3}(\.\d{1,3}){3}. Hashtag: #[\w]+. HTML tag: <[^>]+>. Credit card (16 digits): \d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}. Don't use these blindly in production — edge cases exist. But they're solid starting points for data cleaning and text extraction tasks.
Lookahead and Lookbehind: Advanced Matching
Lookaheads and lookbehinds let you match based on context without including that context in the match. Positive lookahead (?=pattern): match if followed by pattern. Negative lookahead (?!pattern): match if NOT followed by pattern. Positive lookbehind (?<=pattern): match if preceded by pattern. Example: \d+(?= dollars) matches numbers followed by ' dollars' — useful for extracting prices. (?<=\$)\d+ matches digits preceded by a dollar sign. These are powerful for data extraction from semi-structured text.
Frequently Asked Questions
- Is regex the same in Python and JavaScript?
- Mostly yes — the core syntax is the same. Key differences: Python uses re module functions while JavaScript uses string methods like .match() and .replace(). Python raw strings (r'') handle backslashes cleanly. JavaScript regex is written as /pattern/flags literals. Named groups work in both but with slightly different syntax.
- When should I use regex vs string methods?
- Use string methods (split, replace, startswith, etc.) for simple, fixed patterns. Use regex when patterns are variable, complex, or need to match multiple possible formats. If your string logic needs more than 3-4 chained method calls, regex is probably cleaner.
- How do I test my regex?
- Use regex101.com — it shows matches in real time, explains what each part of your pattern does, and lets you test against multiple inputs. It also generates the re.search() code for Python automatically.
- What does greedy vs lazy matching mean?
- By default, quantifiers are greedy — they match as much as possible. Add ? after a quantifier to make it lazy (match as little as possible). For example, .* matches the longest possible string, while .*? matches the shortest. This matters when extracting content between HTML tags.
Ready to Level Up Your Skills?
From regex to full Python data science skills — our bootcamp covers text processing, machine learning, data pipelines, and AI tools in 3 intensive days. Next cohorts October 2026 in 5 cities. Only $1,490.
View Bootcamp Details