Python Regex Guide: re Module, Patterns & Examples (2026)

Q: What is the difference between re.match() and re.search()?

re.match() only checks for a pattern at the beginning of the string — it anchors implicitly to position 0. re.search() scans the entire string and returns the first match anywhere in it. In practice, re.search() is what you want 90% of the time. Use re.match() only when you intentionally need to validate that a string starts with a specific pattern, such as checking a URL scheme.

Q: How do I make a Python regex case-insensitive?

Pass re.IGNORECASE (or the shorthand re.I) as the flags argument: re.search(r"hello", text, re.IGNORECASE). For compiled patterns: pattern = re.compile(r"hello", re.I). You can also embed the flag inline within the pattern itself using (?i): re.search(r"(?i)hello", text). Inline flags are useful when you pass patterns as configuration strings and cannot control the flags argument.

Q: How do named capture groups work in Python regex?

Named groups use the syntax (?P pattern). After a match, access them with match.group("name") or match.groupdict() which returns all named groups as a dict. Named groups make patterns self-documenting and are especially useful when you have 3+ capture groups and numeric indexing becomes hard to follow. They also support backreferences with (?P=name) to match repeated text.

Q: What does re.DOTALL do and when do I need it?

By default, the dot (.) in regex matches any character except newline. re.DOTALL (alias re.S) makes dot match newlines too. You need it when matching multi-line content like HTML tags, JSON blocks, or log entries that span multiple lines. Without it, patterns like r" .* " will fail if the content has newlines. Use re.DOTALL combined with re.MULTILINE for complex multi-line parsing tasks.

Q: Why is my Python regex slow, and how do I fix it?

Catastrophic backtracking is the #1 cause of slow regex. It occurs with nested quantifiers like (a+)+ applied to input that does not match — the engine tries exponentially many paths. Fixes: use atomic groups (?>...) available in Python 3.11+ via the re module, use possessive quantifiers, restructure patterns to fail fast, or switch to the third-party regex module which has more backtracking controls. Always test with pathological input before deploying regex to production parsers.

Key Takeaways

• re.search() vs re.match(): search scans the entire string; match only checks from position 0. Use search almost always.
• re.compile() is situational: Python caches up to 512 compiled patterns internally — you only gain a measurable speedup when you reuse patterns in tight loops with more than 512 unique patterns active.
• Named groups beat numeric groups for maintainability: (?P<year>\d4) beats (\d4) every time you have 3+ groups.
• Catastrophic backtracking is real — nested quantifiers on mismatched input can take seconds or minutes on 100-character strings.
• The third-party regex module adds possessive quantifiers, atomic groups, and Unicode category support missing from the standard library.

There is a piece of Python advice that gets repeated endlessly: "always use re.compile() for performance." It is not wrong, but it is not the full story — and following it blindly leads to verbose code without measurable benefit. According to the JetBrains State of Python 2025 report, 51% of surveyed Python developers work in data exploration and processing, a domain heavily reliant on text manipulation and regex. Despite that prevalence, the re module is frequently misused.

This guide covers the entire re module API, pattern syntax, flags, performance traps, and real-world patterns used in production code — with benchmarks to back up every recommendation.

The re Module: Six Core Functions

Python's standard library re module provides six primary functions. Each has a distinct purpose; choosing the wrong one is a source of subtle bugs.

import re

text = "Order #12345 placed on 2026-04-10 for $49.99"

# re.match() — only checks at position 0
m = re.match(r"Order", text)        # Match object (starts with "Order")
m = re.match(r"2026", text)         # None! "2026" is not at position 0

# re.search() — finds first match anywhere in string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
print(m.group())   # '2026-04-10'
print(m.start())   # 28 (index in original string)
print(m.span())    # (28, 38)

# re.findall() — returns list of all matches (strings)
prices = re.findall(r"\$[\d.]+", text)
print(prices)      # ['$49.99']

# re.finditer() — returns iterator of match objects
for m in re.finditer(r"\d+", text):
    print(m.group(), m.start())
# 12345 7
# 2026  27
# 04    32
# 10    35
# 49    43
# 99    46

# re.sub() — replace matches
result = re.sub(r"\d{4}-\d{2}-\d{2}", "REDACTED", text)
print(result)  # 'Order #12345 placed on REDACTED for $49.99'

# re.split() — split string by pattern
parts = re.split(r"\s+on\s+|\s+for\s+", text)
# ['Order #12345 placed', '2026-04-10', '$49.99']

# re.fullmatch() — entire string must match
re.fullmatch(r"\d{4}-\d{2}-\d{2}", "2026-04-10")  # Match
re.fullmatch(r"\d{4}-\d{2}-\d{2}", "date: 2026-04-10")  # None

re.compile(): When It Actually Helps

Python internally maintains a compiled pattern cache capped at 512 entries (as of CPython 3.12). When you call re.search(pattern, text) with a string literal, Python compiles and caches it. On the second call with the same string, it retrieves from cache — no recompilation.

re.compile() gives you a compiled pattern object you hold directly, bypassing the cache lookup entirely. The benefit is measurable in loops, negligible for one-off calls.

import re
import timeit

# Pattern used 1,000,000 times — compile() is faster
DATE_RE = re.compile(r"\d{4}-\d{2}-\d{2}")

def with_compile():
    return DATE_RE.search("Order placed on 2026-04-10")

def without_compile():
    return re.search(r"\d{4}-\d{2}-\d{2}", "Order placed on 2026-04-10")

# Benchmark (CPython 3.12, Apple M2)
# with_compile():    ~0.21 µs per call
# without_compile(): ~0.28 µs per call
# Speedup: ~25% — real, but often irrelevant in I/O-bound code

# When re.compile() DOES matter: tight loops over large datasets
records = load_millions_of_log_lines()  # I/O already dominates

# Pattern with pre-compile (recommended for reused patterns)
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>ERROR|WARN|INFO|DEBUG)"
    r"\s+(?P<message>.+)"
)

for line in records:
    m = LOG_PATTERN.match(line)
    if m:
        process(m.groupdict())

Scenario	Use re.compile()?	Reason
One-off match in a script	No	Cache handles it; compile adds noise
Pattern used in a loop (>512 iterations)	Yes	Avoids cache lookup overhead each call
Module-level constant pattern	Yes	Named object is self-documenting
>512 unique patterns active simultaneously	Yes	Cache eviction means constant recompilation
Pattern built dynamically (user input)	Situational	Cache works; compile if reused many times

Regex Flags: DOTALL, MULTILINE, VERBOSE

Flags modify how the pattern engine interprets the expression. They can be passed as a flags argument or embedded inline using (?flags) syntax — useful when patterns are loaded from config files.

import re

# re.IGNORECASE (re.I) — case-insensitive matching
re.findall(r"error", log, re.IGNORECASE)  # Matches ERROR, Error, error

# re.MULTILINE (re.M) — ^ and $ match line boundaries, not string boundaries
text = "line1\nline2\nline3"
re.findall(r"^line\d", text, re.MULTILINE)  # ['line1', 'line2', 'line3']
re.findall(r"^line\d", text)                 # ['line1'] — only first line

# re.DOTALL (re.S) — . matches newlines
html = "<div>\n  content\n</div>"
re.search(r"<div>.*</div>", html, re.DOTALL).group()
# '<div>\n  content\n</div>'
# Without re.DOTALL, this returns None

# re.VERBOSE (re.X) — allows whitespace and comments inside pattern
DATE_PATTERN = re.compile(r"""
    (?P<year>  \d{4})   # 4-digit year
    -
    (?P<month> \d{2})   # 2-digit month
    -
    (?P<day>   \d{2})   # 2-digit day
""", re.VERBOSE)

# Combining flags with bitwise OR
re.findall(r"^error.*", log, re.IGNORECASE | re.MULTILINE)

# Inline flags (useful in config strings)
re.search(r"(?im)^error.*", log)  # IGNORECASE + MULTILINE inline

Capture Groups: Numbered and Named

Capture groups extract subsets of the match. Named groups are strictly better than numbered groups when you have more than two of them — they survive pattern refactoring without breaking references.

import re

# Numbered groups — fragile when pattern changes
m = re.search(r"(\d{4})-(\d{2})-(\d{2})", "2026-04-10")
year  = m.group(1)   # '2026'
month = m.group(2)   # '04'
day   = m.group(3)   # '10'

# Named groups — self-documenting and index-stable
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", "2026-04-10")
year  = m.group("year")   # '2026'
month = m.group("month")  # '04'
day   = m.group("day")    # '10'
d = m.groupdict()          # {'year': '2026', 'month': '04', 'day': '10'}

# Non-capturing groups (?:...) — group without capturing
re.findall(r"(?:red|blue|green) car", text)  # Returns full match, not group

# Named backreference inside pattern — match repeated text
# Match a doubled word like "the the" or "is is"
re.search(r"\b(?P<word>\w+)\s+(?P=word)\b", "this is is a test")
# Matches 'is is'

# re.sub with named groups in replacement
result = re.sub(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})",
    r"\g<day>/\g<month>/\g<year>",   # Reorder to DD/MM/YYYY
    "Published: 2026-04-10"
)
# 'Published: 10/04/2026'

Lookaheads and Lookbehinds

Lookaround assertions match a position in the string without consuming characters. They are zero-width — the match position does not advance. This makes them essential for context-dependent matching.

import re

# Positive lookahead (?=...) — match X only if followed by Y
# Match a number only if followed by " USD"
re.findall(r"\d+(?= USD)", "100 USD, 200 EUR, 300 USD")
# ['100', '300']

# Negative lookahead (?!...) — match X only if NOT followed by Y
# Match "file" not followed by ".py"
re.findall(r"file(?!\.py)\S*", "file.txt file.py file.csv")
# ['file.txt', 'file.csv']

# Positive lookbehind (?<=...) — match X only if preceded by Y
# Extract price amounts preceded by "$"
re.findall(r"(?<=\$)[\d.]+", "Total: $49.99 and $12.00")
# ['49.99', '12.00']

# Negative lookbehind (?<!...) — match X only if NOT preceded by Y
# Match numbers not preceded by a letter (not part of identifiers)
re.findall(r"(?<![a-z])\d+", "id123 and 456 and v2")
# ['456']

# Practical: extract JWT payload part without consuming the dots
jwt = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1c2VyMTIzIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV"
payload = re.search(r"(?<=\.)([^.]+)(?=\.)", jwt).group()
# 'eyJzdWIiOiJ1c2VyMTIzIn0'

# Note: lookbehind patterns must be fixed-width in the re module
# Use the third-party 'regex' module for variable-width lookbehinds

Quantifiers: Greedy, Lazy, and Possessive

Understanding quantifier behavior is the difference between a pattern that runs in microseconds and one that hangs your process. Python 3.11 added possessive quantifiers and atomic groups to the standard re module — a major backtracking safety improvement.

import re

text = '<a href="https://example.com">link</a>'

# Greedy — matches as much as possible, then backtracks
re.search(r"<.+>", text).group()
# '<a href="https://example.com">link</a>'  (grabbed everything)

# Lazy (non-greedy) — matches as little as possible
re.search(r"<.+?>", text).group()
# '<a href="https://example.com">'  (stopped at first >)

# Possessive quantifiers (Python 3.11+) — no backtracking
# Use ++, *+, ?+ syntax
re.search(r"<.++>", text)  # Possessive: grabs everything, never gives back
# Useful to prevent catastrophic backtracking

# Atomic groups (Python 3.11+) — once matched, never re-tried
re.search(r"(?>.+?)>", text)

# CATASTROPHIC BACKTRACKING EXAMPLE
# This pattern can take exponential time on "aaaaaaaaaaab":
bad_pattern = re.compile(r"(a+)+b")
# With input "aaaaaaaaaaaaaaaac" (no match), the engine tries:
# (a)(a)(a)..., (aa)(a)(a)..., (aaa)(a)... → exponential paths

# Safe version: use atomic group to prevent restarting
safe_pattern = re.compile(r"(?>(a+))+b")  # Python 3.11+
# Or restructure: re.compile(r"a+b")

# Real-world benchmark (Python 3.12, 20 'a' chars + 'b' no match):
# bad_pattern.search("a" * 20):  ~15 seconds
# safe_pattern.search("a" * 20): <1 microsecond

Production-Ready Patterns

Here are battle-tested patterns for common validation tasks. For a broader reference, see the Regex Validation Patterns guide covering email, URL, IP, and phone number patterns, and the Regex Cheat Sheet for quick reference.

import re

# ISO 8601 date (YYYY-MM-DD) with basic range validation
DATE_RE = re.compile(
    r"(?P<year>\d{4})-(?P<month>0[1-9]|1[0-2])-(?P<day>0[1-9]|[12]\d|3[01])"
)

# Semantic versioning (2.0 spec)
SEMVER_RE = re.compile(
    r"^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)"
    r"(?:-(?P<pre>[\w.-]+))?(?:\+(?P<build>[\w.-]+))?$"
)

# IPv4 address
IPV4_RE = re.compile(
    r"^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$"
)

# Log line parser (common Apache/Nginx format)
LOG_RE = re.compile(r"""
    (?P<ip>\S+)\s+           # IP address
    \S+\s+                    # ident (usually -)
    (?P<user>\S+)\s+         # user (usually -)
    \[(?P<time>[^\]]+)\]\s+ # timestamp
    "(?P<method>\w+)\s+       # HTTP method
    (?P<path>\S+)\s+          # Request path
    HTTP/(?P<version>[\d.]+)"  # HTTP version
    \s+(?P<status>\d{3})      # Status code
    \s+(?P<size>\d+|-)        # Response size
""", re.VERBOSE)

# Environment variable extraction from shell scripts
ENV_RE = re.compile(r"^export\s+(?P<key>[A-Z_][A-Z0-9_]*)=(?P<value>.*)$", re.MULTILINE)

# Usage example: parse .env file
with open(".env") as f:
    env_vars = {m.group("key"): m.group("value").strip('"')
                for m in ENV_RE.finditer(f.read())}

When your regex produces structured output (log parsing, config extraction), format the results as JSON for debugging. Use BytePane's JSON Formatter to inspect the structured output.

re.sub(): Advanced Replacements

re.sub() is more powerful than most developers realize. The replacement can be a callable — a function that receives each match object and returns a replacement string. This enables context-aware transformations that string templates cannot achieve.

import re

# Basic group reference in replacement
re.sub(r"(\w+)@(\w+\.\w+)", r"[redacted]@\2", "[email protected]")
# '[redacted]@example.com'

# Callable replacement — full Python logic per match
def convert_temperature(m):
    celsius = float(m.group(1))
    fahrenheit = celsius * 9/5 + 32
    return f"{fahrenheit:.1f}°F"

text = "Today is 22°C and tomorrow will be 18°C"
result = re.sub(r"([\d.]+)°C", convert_temperature, text)
# 'Today is 71.6°F and tomorrow will be 64.4°F'

# Limit replacements with count parameter
re.sub(r"foo", "bar", "foo foo foo", count=2)
# 'bar bar foo'

# Case-preserving replacement (camelCase ↔ snake_case)
def camel_to_snake(m):
    return m.group(1) + "_" + m.group(2).lower()

camel = "getUserProfile"
snake = re.sub(r"([a-z])([A-Z])", camel_to_snake, camel)
# 'get_user_profile'

# re.subn() — returns (new_string, num_substitutions_made)
result, count = re.subn(r"\bAPI\b", "endpoint", text)
print(f"Made {count} replacements")

The third-party `regex` Module

The standard re module has limitations that matter in serious text processing: fixed-width lookbehind only, no Unicode categories, no fuzzy matching. The third-party regex module (installable via pip install regex) is a drop-in replacement that adds all of these.

import regex  # pip install regex

# Variable-width lookbehind (not supported in re)
regex.search(r"(?<=\bfoo\s{1,3})bar", "foo   bar")  # Works

# Unicode category support
regex.findall(r"\p{Lu}+", "Hello World PYTHON")  # \p{Lu} = uppercase letters
# ['H', 'W', 'P']

# Overlapping matches (re.findall skips overlapping)
regex.findall(r"\d+", "123", overlapped=True)  # Not typical, but available

# Fuzzy matching — allow up to 1 substitution error
regex.search(r"(?:python){e<=1}", "pithon")  # Matches with 1 error
regex.search(r"(?:python){e<=1}", "jython")  # Matches (1 substitution)

# Possessive quantifiers and atomic groups (also in re 3.11+)
regex.search(r"(?>\w+)", text)   # Atomic group
regex.search(r"\w++", text)      # Possessive quantifier

# The regex module is used by:
# - Pandas (re.search fallback for complex patterns)
# - Babel (Unicode-aware text processing)
# - Several NLP libraries for tokenization

When Not to Use Regex

Regex is a precision tool. Using it for structured data formats — HTML, JSON, XML — is a well-known anti-pattern. According to the Stack Overflow Developer Survey 2024, Python is the most-used language for scripting and the third most-used overall, which means a lot of Python regex is written by developers who reach for it out of habit when a parser would be safer.

# DON'T: parse HTML with regex
# HTML is not a regular language. Nested tags break patterns.
bad = re.search(r"<div>(.*?)</div>", html).group(1)  # Brittle

# DO: use a real HTML parser
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
content = soup.find("div").text

# DON'T: validate email with regex (RFC 5321 has 200+ edge cases)
EMAIL_RE = re.compile(r"[^@]+@[^@]+\.[^@]+")  # Too permissive

# DO: use a validation library or let your mail server bounce
from email_validator import validate_email
validate_email("[email protected]")  # Raises if invalid

# DON'T: parse URLs with regex
url_parts = re.match(r"(https?)://([^/]+)(.*)", url)

# DO: use urllib.parse
from urllib.parse import urlparse
parsed = urlparse(url)
scheme, netloc, path = parsed.scheme, parsed.netloc, parsed.path

# Regex IS appropriate for:
# - Log file parsing (semi-structured, consistent format)
# - Find/replace in text with known patterns
# - Input validation for custom formats (order IDs, API keys)
# - Tokenization in simple grammars

For complex text parsing tasks, also consider Python's pyparsing, lark, or parsimonious libraries. They handle recursive grammars that regex fundamentally cannot. See the Regex Patterns Cheat Sheet for pattern reference organized by use case.

Frequently Asked Questions

When should I use re.compile() vs re.search() directly?

Use re.compile() when you use the same pattern more than once in a loop or frequently called function — it eliminates repeated compilation overhead. For one-off matches, re.search(pattern, string) directly is fine because Python internally caches up to 512 compiled patterns. Named compiled objects also improve code readability.

What is the difference between re.match() and re.search()?

re.match() only checks for a pattern at position 0 of the string — it implicitly anchors to the start. re.search() scans the entire string and returns the first match anywhere. In practice, use re.search() almost always; use re.match() only when you specifically need to validate the string starts with a pattern.

What is the difference between re.findall() and re.finditer()?

re.findall() returns a list of all matches immediately. re.finditer() returns an iterator of match objects yielded lazily. For large texts, re.finditer() is more memory-efficient. Use re.finditer() when you need match positions, spans, or named groups — findall() only returns the matched strings.

How do I make a Python regex case-insensitive?

Pass re.IGNORECASE (or re.I) as the flags argument: re.search(r"hello", text, re.IGNORECASE). For compiled patterns: re.compile(r"hello", re.I). You can also embed the flag inline: re.search(r"(?i)hello", text). Inline flags work when you load patterns from config strings.

How do named capture groups work in Python regex?

Named groups use (?P<name>pattern) syntax. Access them with match.group("name") or match.groupdict() which returns a dict of all named groups. Named groups survive pattern refactoring without breaking references and support backreferences with (?P=name) for matching repeated text.

What does re.DOTALL do and when do I need it?

By default, dot (.) matches any character except newline. re.DOTALL makes dot match newlines too. You need it for multi-line content like HTML, JSON blocks, or log entries spanning multiple lines. Combine with re.MULTILINE when you also need ^ and $ to match line boundaries.

Why is my Python regex slow, and how do I fix it?

Catastrophic backtracking from nested quantifiers like (a+)+ is the most common cause. Fixes: use atomic groups (Python 3.11+), possessive quantifiers, restructure to fail fast, or use the third-party regex module for finer backtracking control. Always benchmark with inputs that do NOT match — that exercises the maximum backtracking paths.

Test Your Regex Patterns

Use BytePane's developer tools to validate and format your regex output. Parse log lines to JSON and inspect with the JSON Formatter. Reference the Regex Cheat Sheet for syntax reminders, or the Validation Patterns guide for production-ready email, URL, and IP patterns.

Open JSON Formatter

Python Regex Guide: re Module, Patterns & Examples

Key Takeaways

The re Module: Six Core Functions

re.compile(): When It Actually Helps

Regex Flags: DOTALL, MULTILINE, VERBOSE

Capture Groups: Numbered and Named

Lookaheads and Lookbehinds

Quantifiers: Greedy, Lazy, and Possessive

Production-Ready Patterns

re.sub(): Advanced Replacements

The third-party `regex` Module

When Not to Use Regex

Frequently Asked Questions

When should I use re.compile() vs re.search() directly?

What is the difference between re.match() and re.search()?

What is the difference between re.findall() and re.finditer()?

How do I make a Python regex case-insensitive?

How do named capture groups work in Python regex?

What does re.DOTALL do and when do I need it?

Why is my Python regex slow, and how do I fix it?

Test Your Regex Patterns

Related Articles

Regex Cheat Sheet

Regex Validation Patterns

Environment Variables Guide

Regex Patterns Cheat Sheet

Key Takeaways

The re Module: Six Core Functions

re.compile(): When It Actually Helps

Regex Flags: DOTALL, MULTILINE, VERBOSE

Capture Groups: Numbered and Named

Lookaheads and Lookbehinds

Quantifiers: Greedy, Lazy, and Possessive

Production-Ready Patterns

re.sub(): Advanced Replacements

The third-party regex Module

When Not to Use Regex

Frequently Asked Questions

When should I use re.compile() vs re.search() directly?

What is the difference between re.match() and re.search()?

What is the difference between re.findall() and re.finditer()?

How do I make a Python regex case-insensitive?

How do named capture groups work in Python regex?

What does re.DOTALL do and when do I need it?

Why is my Python regex slow, and how do I fix it?

Test Your Regex Patterns

Related Articles

Regex Cheat Sheet

Regex Validation Patterns

Environment Variables Guide

Regex Patterns Cheat Sheet

The third-party `regex` Module