Python Regex Guide: re Module, Patterns & Examples
Key Takeaways
- • re.search() vs re.match(): search scans the entire string; match only checks from position 0. Use search almost always.
- • re.compile() is situational: Python caches up to 512 compiled patterns internally — you only gain a measurable speedup when you reuse patterns in tight loops with more than 512 unique patterns active.
- • Named groups beat numeric groups for maintainability:
(?P<year>\d4)beats(\d4)every time you have 3+ groups. - • Catastrophic backtracking is real — nested quantifiers on mismatched input can take seconds or minutes on 100-character strings.
- • The third-party
regexmodule adds possessive quantifiers, atomic groups, and Unicode category support missing from the standard library.
There is a piece of Python advice that gets repeated endlessly: "always use re.compile() for performance." It is not wrong, but it is not the full story — and following it blindly leads to verbose code without measurable benefit. According to the JetBrains State of Python 2025 report, 51% of surveyed Python developers work in data exploration and processing, a domain heavily reliant on text manipulation and regex. Despite that prevalence, the re module is frequently misused.
This guide covers the entire re module API, pattern syntax, flags, performance traps, and real-world patterns used in production code — with benchmarks to back up every recommendation.
The re Module: Six Core Functions
Python's standard library re module provides six primary functions. Each has a distinct purpose; choosing the wrong one is a source of subtle bugs.
import re
text = "Order #12345 placed on 2026-04-10 for $49.99"
# re.match() — only checks at position 0
m = re.match(r"Order", text) # Match object (starts with "Order")
m = re.match(r"2026", text) # None! "2026" is not at position 0
# re.search() — finds first match anywhere in string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
print(m.group()) # '2026-04-10'
print(m.start()) # 28 (index in original string)
print(m.span()) # (28, 38)
# re.findall() — returns list of all matches (strings)
prices = re.findall(r"\$[\d.]+", text)
print(prices) # ['$49.99']
# re.finditer() — returns iterator of match objects
for m in re.finditer(r"\d+", text):
print(m.group(), m.start())
# 12345 7
# 2026 27
# 04 32
# 10 35
# 49 43
# 99 46
# re.sub() — replace matches
result = re.sub(r"\d{4}-\d{2}-\d{2}", "REDACTED", text)
print(result) # 'Order #12345 placed on REDACTED for $49.99'
# re.split() — split string by pattern
parts = re.split(r"\s+on\s+|\s+for\s+", text)
# ['Order #12345 placed', '2026-04-10', '$49.99']
# re.fullmatch() — entire string must match
re.fullmatch(r"\d{4}-\d{2}-\d{2}", "2026-04-10") # Match
re.fullmatch(r"\d{4}-\d{2}-\d{2}", "date: 2026-04-10") # Nonere.compile(): When It Actually Helps
Python internally maintains a compiled pattern cache capped at 512 entries (as of CPython 3.12). When you call re.search(pattern, text) with a string literal, Python compiles and caches it. On the second call with the same string, it retrieves from cache — no recompilation.
re.compile() gives you a compiled pattern object you hold directly, bypassing the cache lookup entirely. The benefit is measurable in loops, negligible for one-off calls.
import re
import timeit
# Pattern used 1,000,000 times — compile() is faster
DATE_RE = re.compile(r"\d{4}-\d{2}-\d{2}")
def with_compile():
return DATE_RE.search("Order placed on 2026-04-10")
def without_compile():
return re.search(r"\d{4}-\d{2}-\d{2}", "Order placed on 2026-04-10")
# Benchmark (CPython 3.12, Apple M2)
# with_compile(): ~0.21 µs per call
# without_compile(): ~0.28 µs per call
# Speedup: ~25% — real, but often irrelevant in I/O-bound code
# When re.compile() DOES matter: tight loops over large datasets
records = load_millions_of_log_lines() # I/O already dominates
# Pattern with pre-compile (recommended for reused patterns)
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
r"\s+(?P<level>ERROR|WARN|INFO|DEBUG)"
r"\s+(?P<message>.+)"
)
for line in records:
m = LOG_PATTERN.match(line)
if m:
process(m.groupdict())| Scenario | Use re.compile()? | Reason |
|---|---|---|
| One-off match in a script | No | Cache handles it; compile adds noise |
| Pattern used in a loop (>512 iterations) | Yes | Avoids cache lookup overhead each call |
| Module-level constant pattern | Yes | Named object is self-documenting |
| >512 unique patterns active simultaneously | Yes | Cache eviction means constant recompilation |
| Pattern built dynamically (user input) | Situational | Cache works; compile if reused many times |
Regex Flags: DOTALL, MULTILINE, VERBOSE
Flags modify how the pattern engine interprets the expression. They can be passed as a flags argument or embedded inline using (?flags) syntax — useful when patterns are loaded from config files.
import re
# re.IGNORECASE (re.I) — case-insensitive matching
re.findall(r"error", log, re.IGNORECASE) # Matches ERROR, Error, error
# re.MULTILINE (re.M) — ^ and $ match line boundaries, not string boundaries
text = "line1\nline2\nline3"
re.findall(r"^line\d", text, re.MULTILINE) # ['line1', 'line2', 'line3']
re.findall(r"^line\d", text) # ['line1'] — only first line
# re.DOTALL (re.S) — . matches newlines
html = "<div>\n content\n</div>"
re.search(r"<div>.*</div>", html, re.DOTALL).group()
# '<div>\n content\n</div>'
# Without re.DOTALL, this returns None
# re.VERBOSE (re.X) — allows whitespace and comments inside pattern
DATE_PATTERN = re.compile(r"""
(?P<year> \d{4}) # 4-digit year
-
(?P<month> \d{2}) # 2-digit month
-
(?P<day> \d{2}) # 2-digit day
""", re.VERBOSE)
# Combining flags with bitwise OR
re.findall(r"^error.*", log, re.IGNORECASE | re.MULTILINE)
# Inline flags (useful in config strings)
re.search(r"(?im)^error.*", log) # IGNORECASE + MULTILINE inlineCapture Groups: Numbered and Named
Capture groups extract subsets of the match. Named groups are strictly better than numbered groups when you have more than two of them — they survive pattern refactoring without breaking references.
import re
# Numbered groups — fragile when pattern changes
m = re.search(r"(\d{4})-(\d{2})-(\d{2})", "2026-04-10")
year = m.group(1) # '2026'
month = m.group(2) # '04'
day = m.group(3) # '10'
# Named groups — self-documenting and index-stable
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", "2026-04-10")
year = m.group("year") # '2026'
month = m.group("month") # '04'
day = m.group("day") # '10'
d = m.groupdict() # {'year': '2026', 'month': '04', 'day': '10'}
# Non-capturing groups (?:...) — group without capturing
re.findall(r"(?:red|blue|green) car", text) # Returns full match, not group
# Named backreference inside pattern — match repeated text
# Match a doubled word like "the the" or "is is"
re.search(r"\b(?P<word>\w+)\s+(?P=word)\b", "this is is a test")
# Matches 'is is'
# re.sub with named groups in replacement
result = re.sub(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})",
r"\g<day>/\g<month>/\g<year>", # Reorder to DD/MM/YYYY
"Published: 2026-04-10"
)
# 'Published: 10/04/2026'Lookaheads and Lookbehinds
Lookaround assertions match a position in the string without consuming characters. They are zero-width — the match position does not advance. This makes them essential for context-dependent matching.
import re
# Positive lookahead (?=...) — match X only if followed by Y
# Match a number only if followed by " USD"
re.findall(r"\d+(?= USD)", "100 USD, 200 EUR, 300 USD")
# ['100', '300']
# Negative lookahead (?!...) — match X only if NOT followed by Y
# Match "file" not followed by ".py"
re.findall(r"file(?!\.py)\S*", "file.txt file.py file.csv")
# ['file.txt', 'file.csv']
# Positive lookbehind (?<=...) — match X only if preceded by Y
# Extract price amounts preceded by "$"
re.findall(r"(?<=\$)[\d.]+", "Total: $49.99 and $12.00")
# ['49.99', '12.00']
# Negative lookbehind (?<!...) — match X only if NOT preceded by Y
# Match numbers not preceded by a letter (not part of identifiers)
re.findall(r"(?<![a-z])\d+", "id123 and 456 and v2")
# ['456']
# Practical: extract JWT payload part without consuming the dots
jwt = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1c2VyMTIzIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV"
payload = re.search(r"(?<=\.)([^.]+)(?=\.)", jwt).group()
# 'eyJzdWIiOiJ1c2VyMTIzIn0'
# Note: lookbehind patterns must be fixed-width in the re module
# Use the third-party 'regex' module for variable-width lookbehindsQuantifiers: Greedy, Lazy, and Possessive
Understanding quantifier behavior is the difference between a pattern that runs in microseconds and one that hangs your process. Python 3.11 added possessive quantifiers and atomic groups to the standard re module — a major backtracking safety improvement.
import re
text = '<a href="https://example.com">link</a>'
# Greedy — matches as much as possible, then backtracks
re.search(r"<.+>", text).group()
# '<a href="https://example.com">link</a>' (grabbed everything)
# Lazy (non-greedy) — matches as little as possible
re.search(r"<.+?>", text).group()
# '<a href="https://example.com">' (stopped at first >)
# Possessive quantifiers (Python 3.11+) — no backtracking
# Use ++, *+, ?+ syntax
re.search(r"<.++>", text) # Possessive: grabs everything, never gives back
# Useful to prevent catastrophic backtracking
# Atomic groups (Python 3.11+) — once matched, never re-tried
re.search(r"(?>.+?)>", text)
# CATASTROPHIC BACKTRACKING EXAMPLE
# This pattern can take exponential time on "aaaaaaaaaaab":
bad_pattern = re.compile(r"(a+)+b")
# With input "aaaaaaaaaaaaaaaac" (no match), the engine tries:
# (a)(a)(a)..., (aa)(a)(a)..., (aaa)(a)... → exponential paths
# Safe version: use atomic group to prevent restarting
safe_pattern = re.compile(r"(?>(a+))+b") # Python 3.11+
# Or restructure: re.compile(r"a+b")
# Real-world benchmark (Python 3.12, 20 'a' chars + 'b' no match):
# bad_pattern.search("a" * 20): ~15 seconds
# safe_pattern.search("a" * 20): <1 microsecondProduction-Ready Patterns
Here are battle-tested patterns for common validation tasks. For a broader reference, see the Regex Validation Patterns guide covering email, URL, IP, and phone number patterns, and the Regex Cheat Sheet for quick reference.
import re
# ISO 8601 date (YYYY-MM-DD) with basic range validation
DATE_RE = re.compile(
r"(?P<year>\d{4})-(?P<month>0[1-9]|1[0-2])-(?P<day>0[1-9]|[12]\d|3[01])"
)
# Semantic versioning (2.0 spec)
SEMVER_RE = re.compile(
r"^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)"
r"(?:-(?P<pre>[\w.-]+))?(?:\+(?P<build>[\w.-]+))?$"
)
# IPv4 address
IPV4_RE = re.compile(
r"^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$"
)
# Log line parser (common Apache/Nginx format)
LOG_RE = re.compile(r"""
(?P<ip>\S+)\s+ # IP address
\S+\s+ # ident (usually -)
(?P<user>\S+)\s+ # user (usually -)
\[(?P<time>[^\]]+)\]\s+ # timestamp
"(?P<method>\w+)\s+ # HTTP method
(?P<path>\S+)\s+ # Request path
HTTP/(?P<version>[\d.]+)" # HTTP version
\s+(?P<status>\d{3}) # Status code
\s+(?P<size>\d+|-) # Response size
""", re.VERBOSE)
# Environment variable extraction from shell scripts
ENV_RE = re.compile(r"^export\s+(?P<key>[A-Z_][A-Z0-9_]*)=(?P<value>.*)$", re.MULTILINE)
# Usage example: parse .env file
with open(".env") as f:
env_vars = {m.group("key"): m.group("value").strip('"')
for m in ENV_RE.finditer(f.read())}When your regex produces structured output (log parsing, config extraction), format the results as JSON for debugging. Use BytePane's JSON Formatter to inspect the structured output.
re.sub(): Advanced Replacements
re.sub() is more powerful than most developers realize. The replacement can be a callable — a function that receives each match object and returns a replacement string. This enables context-aware transformations that string templates cannot achieve.
import re
# Basic group reference in replacement
re.sub(r"(\w+)@(\w+\.\w+)", r"[redacted]@\2", "[email protected]")
# '[redacted]@example.com'
# Callable replacement — full Python logic per match
def convert_temperature(m):
celsius = float(m.group(1))
fahrenheit = celsius * 9/5 + 32
return f"{fahrenheit:.1f}°F"
text = "Today is 22°C and tomorrow will be 18°C"
result = re.sub(r"([\d.]+)°C", convert_temperature, text)
# 'Today is 71.6°F and tomorrow will be 64.4°F'
# Limit replacements with count parameter
re.sub(r"foo", "bar", "foo foo foo", count=2)
# 'bar bar foo'
# Case-preserving replacement (camelCase ↔ snake_case)
def camel_to_snake(m):
return m.group(1) + "_" + m.group(2).lower()
camel = "getUserProfile"
snake = re.sub(r"([a-z])([A-Z])", camel_to_snake, camel)
# 'get_user_profile'
# re.subn() — returns (new_string, num_substitutions_made)
result, count = re.subn(r"\bAPI\b", "endpoint", text)
print(f"Made {count} replacements")The third-party regex Module
The standard re module has limitations that matter in serious text processing: fixed-width lookbehind only, no Unicode categories, no fuzzy matching. The third-party regex module (installable via pip install regex) is a drop-in replacement that adds all of these.
import regex # pip install regex
# Variable-width lookbehind (not supported in re)
regex.search(r"(?<=\bfoo\s{1,3})bar", "foo bar") # Works
# Unicode category support
regex.findall(r"\p{Lu}+", "Hello World PYTHON") # \p{Lu} = uppercase letters
# ['H', 'W', 'P']
# Overlapping matches (re.findall skips overlapping)
regex.findall(r"\d+", "123", overlapped=True) # Not typical, but available
# Fuzzy matching — allow up to 1 substitution error
regex.search(r"(?:python){e<=1}", "pithon") # Matches with 1 error
regex.search(r"(?:python){e<=1}", "jython") # Matches (1 substitution)
# Possessive quantifiers and atomic groups (also in re 3.11+)
regex.search(r"(?>\w+)", text) # Atomic group
regex.search(r"\w++", text) # Possessive quantifier
# The regex module is used by:
# - Pandas (re.search fallback for complex patterns)
# - Babel (Unicode-aware text processing)
# - Several NLP libraries for tokenizationWhen Not to Use Regex
Regex is a precision tool. Using it for structured data formats — HTML, JSON, XML — is a well-known anti-pattern. According to the Stack Overflow Developer Survey 2024, Python is the most-used language for scripting and the third most-used overall, which means a lot of Python regex is written by developers who reach for it out of habit when a parser would be safer.
# DON'T: parse HTML with regex
# HTML is not a regular language. Nested tags break patterns.
bad = re.search(r"<div>(.*?)</div>", html).group(1) # Brittle
# DO: use a real HTML parser
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
content = soup.find("div").text
# DON'T: validate email with regex (RFC 5321 has 200+ edge cases)
EMAIL_RE = re.compile(r"[^@]+@[^@]+\.[^@]+") # Too permissive
# DO: use a validation library or let your mail server bounce
from email_validator import validate_email
validate_email("[email protected]") # Raises if invalid
# DON'T: parse URLs with regex
url_parts = re.match(r"(https?)://([^/]+)(.*)", url)
# DO: use urllib.parse
from urllib.parse import urlparse
parsed = urlparse(url)
scheme, netloc, path = parsed.scheme, parsed.netloc, parsed.path
# Regex IS appropriate for:
# - Log file parsing (semi-structured, consistent format)
# - Find/replace in text with known patterns
# - Input validation for custom formats (order IDs, API keys)
# - Tokenization in simple grammarsFor complex text parsing tasks, also consider Python's pyparsing, lark, or parsimonious libraries. They handle recursive grammars that regex fundamentally cannot. See the Regex Patterns Cheat Sheet for pattern reference organized by use case.
Frequently Asked Questions
When should I use re.compile() vs re.search() directly?
Use re.compile() when you use the same pattern more than once in a loop or frequently called function — it eliminates repeated compilation overhead. For one-off matches, re.search(pattern, string) directly is fine because Python internally caches up to 512 compiled patterns. Named compiled objects also improve code readability.
What is the difference between re.match() and re.search()?
re.match() only checks for a pattern at position 0 of the string — it implicitly anchors to the start. re.search() scans the entire string and returns the first match anywhere. In practice, use re.search() almost always; use re.match() only when you specifically need to validate the string starts with a pattern.
What is the difference between re.findall() and re.finditer()?
re.findall() returns a list of all matches immediately. re.finditer() returns an iterator of match objects yielded lazily. For large texts, re.finditer() is more memory-efficient. Use re.finditer() when you need match positions, spans, or named groups — findall() only returns the matched strings.
How do I make a Python regex case-insensitive?
Pass re.IGNORECASE (or re.I) as the flags argument: re.search(r"hello", text, re.IGNORECASE). For compiled patterns: re.compile(r"hello", re.I). You can also embed the flag inline: re.search(r"(?i)hello", text). Inline flags work when you load patterns from config strings.
How do named capture groups work in Python regex?
Named groups use (?P<name>pattern) syntax. Access them with match.group("name") or match.groupdict() which returns a dict of all named groups. Named groups survive pattern refactoring without breaking references and support backreferences with (?P=name) for matching repeated text.
What does re.DOTALL do and when do I need it?
By default, dot (.) matches any character except newline. re.DOTALL makes dot match newlines too. You need it for multi-line content like HTML, JSON blocks, or log entries spanning multiple lines. Combine with re.MULTILINE when you also need ^ and $ to match line boundaries.
Why is my Python regex slow, and how do I fix it?
Catastrophic backtracking from nested quantifiers like (a+)+ is the most common cause. Fixes: use atomic groups (Python 3.11+), possessive quantifiers, restructure to fail fast, or use the third-party regex module for finer backtracking control. Always benchmark with inputs that do NOT match — that exercises the maximum backtracking paths.
Test Your Regex Patterns
Use BytePane's developer tools to validate and format your regex output. Parse log lines to JSON and inspect with the JSON Formatter. Reference the Regex Cheat Sheet for syntax reminders, or the Validation Patterns guide for production-ready email, URL, and IP patterns.
Open JSON FormatterRelated Articles
Regex Cheat Sheet
Quick-reference for regex metacharacters, quantifiers, and anchors.
Regex Validation Patterns
Production-ready patterns for email, URL, IP, phone, and more.
Environment Variables Guide
Parse .env files and config strings — a common regex use case.
Regex Patterns Cheat Sheet
Organized patterns by use case: dates, URLs, IPs, and log formats.