BytePane

Regex Tutorial: Learn Regular Expressions Step by Step (2026)

Text Processing20 min read

Key Takeaways

  • According to Stack Overflow's developer surveys, approximately 78% of professional developers use regex at least monthly — it is a core competency for backend, data engineering, and DevOps roles.
  • Research published in ASE 2019 (Servant, Davis, et al.) found that 65% of developers found regex daunting when learning — most production regex bugs come from quantifier ambiguity and greedy vs. lazy mismatches.
  • ReDoS (Regular Expression Denial of Service) vulnerabilities exploit catastrophic backtracking in nested quantifiers — real exploits have caused outages at Cloudflare, Node.js, and Ruby on Rails.
  • Go's RE2 engine guarantees O(n) linear time matching by prohibiting backreferences and lookaheads — the trade-off is worth it for user-facing input validation at scale.
  • Named capture groups ((?<name>pattern)) were added to JavaScript in ES2018 and dramatically improve regex readability for complex patterns — use them in any pattern with 2+ groups.

The Log File That Took Down the Monitoring System

In 2019, a misconfigured regular expression in Cloudflare's Web Application Firewall caused a global outage lasting 27 minutes, dropping HTTP traffic by approximately 82%. The root cause: a new WAF rule introduced a regex with a catastrophically backtracking pattern — specifically, a sequence involving a wildcard pattern matching against itself. The CPU on every core handling HTTP traffic pegged at 100%, starving legitimate requests.

This was not an exotic edge case. The same class of regex — one with nested quantifiers that can explore exponentially many paths — appears routinely in production codebases written by developers who understand regex syntax but not the underlying NFA engine. Per research by Davis et al. published in IEEE/ACM ASE 2019, 13% of regexes extracted from npm packages showed potential super-linear worst-case behavior.

This tutorial starts from the basics — literals, metacharacters, quantifiers — and builds to the advanced concepts that separate regex competence from regex expertise: atomic groups, possessive quantifiers, lookaheads, backreferences, and the crucial question of when to use a regex engine vs. a parser. Along the way, we will build the mental model that prevents the Cloudflare class of bugs.

Part 1: Regex Fundamentals

Literal Characters and Metacharacters

The simplest regex is a literal string. The pattern hello matches the sequence “hello” anywhere in the input. What makes regex powerful — and complex — is the set of metacharacters that represent classes of characters or control matching behavior:

MetacharacterMeaningExampleMatches
.Any character (except newline)c.tcat, cut, c8t, c-t
\dAny digit [0-9]\d\d\d123, 042, 999
\wWord char [a-zA-Z0-9_]\w+hello, foo_bar, ABC123
\sWhitespace (space, tab, newline)foo\sbar"foo bar", "foo\tbar"
^Start of string (or line in /m)^Hello"Hello world" (not "Say Hello")
$End of string (or line in /m)world$"Hello world" (not "worldwide")
[abc]Character class — a, b, or c[aeiou]Any single vowel
[^abc]Negated class — NOT a, b, or c[^0-9]Any non-digit character
\bWord boundary\bcat\b"cat" not "cats" or "concatenate"
|Alternation (OR)cat|dog"cat" or "dog"

The uppercase versions negate the shorthand: \D matches non-digits, \W matches non-word characters, \S matches non-whitespace. These are equivalent to their negated class counterparts: \D = [^0-9].

Quantifiers: Controlling Repetition

Quantifiers attach to the preceding element and specify how many times it can repeat:

Quantifier reference with examples
// * — zero or more (greedy)
/go*gle/.test("ggle")   // true (0 o's)
/go*gle/.test("google") // true (2 o's)
/go*gle/.test("gooogle") // true (3 o's)

// + — one or more (greedy)
/go+gle/.test("ggle")   // false (needs at least 1 o)
/go+gle/.test("google") // true

// ? — zero or one (optional)
/colou?r/.test("color")  // true (u is optional)
/colou?r/.test("colour") // true

// {n} — exactly n times
/\d{4}/.test("2026") // true — exactly 4 digits

// {n,m} — between n and m times (inclusive)
/\w{2,5}/.test("hi")      // true (2 chars)
/\w{2,5}/.test("hello")   // true (5 chars)
/\w{2,5}/.test("a")       // false (1 char)
/\w{2,5}/.test("toolong") // true — matches first 5 chars ("toolo")

// {n,} — n or more times
/\d{3,}/.test("12")    // false (only 2 digits)
/\d{3,}/.test("12345") // true

Part 2: Greedy vs. Lazy Quantifiers

This is where the majority of regex bugs live. By default, all quantifiers are greedy — they match as much as possible while still allowing the overall pattern to succeed. Adding ? after a quantifier makes it lazy (minimal) — it matches as little as possible.

Greedy vs lazy — the HTML tag extraction problem
const html = '<b>bold</b> and <em>italic</em>'

// Greedy: <.*> matches from first < to LAST >
html.match(/<.*>/)
// → ['<b>bold</b> and <em>italic</em>']
// The .* consumed everything between the first and LAST angle bracket

// Lazy: <.*?> matches from < to the NEAREST >
html.match(/<.*?>/g)
// → ['<b>', '</b>', '<em>', '</em>']

// Better solution: character class negation (no backtracking risk)
html.match(/<[^>]+>/g)
// → ['<b>', '</b>', '<em>', '</em>']
// [^>]+ means "one or more chars that are NOT >"
// This is preferred because it has no backtracking ambiguity

// Quantifier lazy variants:
// *?   →  *  with lazy matching
// +?   →  +  with lazy matching
// ??   →  ?  with lazy matching
// {n,m}? → {n,m} with lazy matching

The character class negation approach ([^>]+) is generally preferable to lazy quantifiers for two reasons: it is more explicit about intent (matches everything except the delimiter), and it eliminates backtracking ambiguity — the engine knows immediately when to stop without exploring alternative paths.

Part 3: Capturing Groups, Named Groups, and Non-Capturing Groups

Parentheses in regex serve two purposes simultaneously: they group subpatterns (allowing quantifiers to apply to multiple characters), and they capture the matched substring for later use.

Groups: capturing, non-capturing, and named (ES2018+)
// Capturing group: (pattern)
// Captures are accessible at result[1], result[2], etc.
const dateStr = '2026-04-15'
const match = dateStr.match(/(\d{4})-(\d{2})-(\d{2})/)
// match[0] = '2026-04-15'  (full match)
// match[1] = '2026'        (capture group 1)
// match[2] = '04'          (capture group 2)
// match[3] = '15'          (capture group 3)

// Non-capturing group: (?:pattern)
// Groups without capturing — use for alternation without polluting capture indices
const version = '3.14.159'
const semver = version.match(/(?:\d+\.){2}\d+/)
// Groups without creating match[1], match[2]

// Named capturing group: (?<name>pattern) — ES2018+
const namedMatch = dateStr.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/)
const { year, month, day } = namedMatch.groups
// year  = '2026'
// month = '04'
// day   = '15'

// Named groups in replacement strings
'2026-04-15'.replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  '$<month>/$<day>/$<year>'
)
// → '04/15/2026'

// Backreference to named group in the same pattern
// Match repeated words: "the the" or "a a"
/\b(?<word>\w+)\s+\k<word>\b/i.test('the the')  // true
/\b(?<word>\w+)\s+\k<word>\b/i.test('the The')  // true (i flag)
/\b(?<word>\w+)\s+\k<word>\b/i.test('the cat')  // false

Named capture groups were standardized in ECMAScript 2018 (V8 6.0+, Node.js 10+). Per the ES2018 specification (TC39 proposal by Gorkem Yakin and Daniel Ehrenberg), named groups are referenced in patterns via \k<name> and in replacement strings via $<name>. They are available in Python (via (?P<name>) syntax), PHP, .NET, Java 7+, and PCRE.

For any pattern with more than two capture groups, named groups are not optional — they are required for maintainability. The cognitive load of tracking which index corresponds to which field is eliminated.

Part 4: Lookahead and Lookbehind Assertions

Lookaround assertions match a position in the string based on what comes before or after, without consuming characters. They are zero-width — they assert context without advancing the match position.

All four lookaround types with practical examples
// Positive lookahead: (?=pattern)
// "match X only if followed by Y" — Y is not part of the match

// Extract version numbers from "v1.2.3" format
'Release v1.2.3 is ready'.match(/\d+\.\d+\.\d+(?= is ready)/)
// → ['1.2.3']  (the " is ready" is not captured)

// Password validation: at least one digit
/^(?=.*\d).{8,}$/.test('password1')  // true
/^(?=.*\d).{8,}$/.test('password')   // false (no digit)

// Negative lookahead: (?!pattern)
// "match X only if NOT followed by Y"

// Match "foo" not followed by "bar"
'foobar foobaz fooqux'.match(/foo(?!bar)\w*/g)
// → ['foobaz', 'fooqux']  ('foobar' excluded)

// Positive lookbehind: (?<=pattern) — ES2018+
// "match X only if preceded by Y"

// Extract dollar amounts after "$"
'Price: $42.00, Cost: $15'.match(/(?<=\$)\d+\.?\d*/g)
// → ['42.00', '15']

// Negative lookbehind: (?<!pattern) — ES2018+
// "match X only if NOT preceded by Y"

// Match numbers not preceded by "$"
'Buy $5 and save 10 dollars'.match(/(?<!\$)\b\d+\b/g)
// → ['10']  (the '$5' is excluded)

// Note: Lookbehind was added in ES2018 (V8 6.2+, Node.js 9.11.2+)
// Go's RE2 does NOT support lookahead or lookbehind
// Python's re module supports lookahead/lookbehind

Lookbehind assertions were proposed by Gorkem Yakin and Nozomu Katō and accepted into ES2018. They are fully supported in V8 (Chrome, Node.js), SpiderMonkey (Firefox), and JavaScriptCore (Safari 16.4+). The key limitation: Go's RE2 engine does not support any lookaround assertions — this is a deliberate design choice to guarantee O(n) matching time. If your Go application needs lookahead-style behavior, restructure the pattern using alternation or process the match result in code.

Part 5: Regex Flags — The Behavior Modifiers

Flags are appended after the closing delimiter: /pattern/flags. JavaScript has seven flags as of ES2024:

FlagNameEffectAdded
gGlobalFind all matches, not just first. Makes exec() stateful via lastIndex.ES1
iCase-insensitiveA matches both A and a.ES1
mMultiline^ and $ match line boundaries, not just string start/end.ES3
sDotAll. matches newlines. Without /s, . matches any char except \n.ES2018
uUnicodeEnables Unicode property escapes (\p{L}). Required for \p{...} syntax.ES2015
dIndicesAdds match.indices array with start/end positions for each group.ES2022
vUnicodeSetsEnables set notation in character classes [A--Z] and \p inside classes.ES2024
The /g flag's stateful lastIndex trap — a common bug source
// /g makes exec() stateful — it remembers where it left off via lastIndex
const re = /\d+/g  // stored in a variable — gets reused

re.exec('42 and 100')  // { 0: '42',  index: 0 }, re.lastIndex = 2
re.exec('42 and 100')  // { 0: '100', index: 7 }, re.lastIndex = 10
re.exec('42 and 100')  // null (no more matches), re.lastIndex = 0

// BUG: if you call exec on a different string, lastIndex is still set
const re2 = /\w+/g
re2.exec('hello world')  // { 0: 'hello', index: 0 }, lastIndex = 5
re2.exec('abc')          // { 0: 'bc', index: 1 } ← WRONG! started at 5, wrapped
// This is why regex with /g in module scope causes hard-to-reproduce bugs

// Solution 1: Create a new regex each call (avoid regex literal in module scope)
function findAll(str) {
  return [...str.matchAll(/\d+/g)]  // matchAll always creates fresh iterator
}

// Solution 2: Reset lastIndex manually
const reGlobal = /\d+/g
function safeExec(str) {
  reGlobal.lastIndex = 0  // reset before use
  return reGlobal.exec(str)
}

Part 6: ReDoS — When Regex Becomes a Security Vulnerability

ReDoS (Regular Expression Denial of Service) is a real attack class, not a theoretical concern. The Cloudflare outage mentioned earlier is one of dozens of documented incidents. Node.js has shipped multiple ReDoS-related CVEs. The root cause is always the same: backtracking NFA engines with nested quantifiers.

Classic ReDoS pattern — exponential worst case
// (a+)+ is the canonical ReDoS example
// For input "aaaaaaaaac" (n 'a's then 'c'):
// The engine tries EVERY way to partition n 'a's into groups of one or more
// This is 2^(n-1) partitions — exponential in n

const vulnerable = /(a+)+/
const payload = 'a'.repeat(30) + 'c'  // 30 a's then 'c'

// DON'T RUN THIS — it will hang your process
// console.time(); vulnerable.test(payload); console.timeEnd()

// Patterns with ReDoS risk:
// (a|a)+   — alternation between identical patterns + outer quantifier
// (a*)*    — nested quantifiers
// \w+\s*;  — at line end without anchor (in certain contexts)

// Real-world vulnerable pattern from CVE-2022-25883 (semver npm package):
// /^\s*(\d+\.\d+\.\d+)\s*$/ — NOT vulnerable
// /^(\s+|\s*,\s*)*$/         — POTENTIALLY vulnerable

// Safe alternatives:
// 1. Use possessive quantifiers if your engine supports them (PCRE/Java)
//    (a++)+ — does not backtrack within the group
// 2. Use atomic groups: (?>a+)+
// 3. Use Go's RE2 — linear time, no backtracking
// 4. Validate with the 'safe-regex' npm package
// 5. Set regex timeout limits at the gateway level
Defending against ReDoS in Node.js applications
// Option 1: node-re2 — RE2 engine for Node.js, O(n) guaranteed
import RE2 from 're2'

const re = new RE2('(a+)+')  // RE2 rejects patterns it can't guarantee-safe
// RE2 throws: "Invalid argument (re2): parentheses not balanced: (a+)+"
// Or silently converts to safe equivalent in some versions

// Option 2: safe-regex package — static analysis
import safeRegex from 'safe-regex'
safeRegex(/(a+)+/)    // false — flagged as potentially catastrophic
safeRegex(/\d+/)      // true — simple, linear

// Option 3: regex-timeout — wraps regex with a time limit
// When user-supplied patterns are evaluated (e.g., search feature),
// always use a timeout wrapper or compile in a Worker with a kill timer

// Option 4: Pattern design — avoid nested quantifiers on same chars
// RISKY: (\w+\s*)+
// SAFE:  \w+(\s+\w+)*   — same semantic, no ambiguity in partition

Per research by Davis et al. (2019), regular expressions with potentially super-linear worst-case behavior appeared in approximately 5.6% of npm packages sampled. The authors specifically called out the npm ecosystem as high-risk because many packages expose regex to user-controlled input (search, routing, validation) without ReDoS analysis.

Part 7: Regex Across Languages — Key Differences

The PCRE (Perl Compatible Regular Expressions) standard is the de facto baseline, but each language's implementation diverges in important ways. Understanding these differences prevents “it works in Python but not Go” debugging sessions.

FeatureJavaScript (V8)Python (re)Go (RE2)Rust (regex)
Engine typeNFA (backtracking)NFA (backtracking)RE2 (linear DFA)RE2 (linear DFA)
BackreferencesYes (\1, \k<name>)Yes (\1, (?P=name))NoNo
LookaheadYes (?=) (?!)YesNoNo
LookbehindYes (ES2018)Yes (fixed width)NoNo
Named groupsYes (?<name>)Yes (?P<name>)Yes (?P<name>)Yes (?P<name>)
Unicode property \p{L}Yes (with /u flag)Via regex libYesYes
Worst-case complexityExponential (ReDoS)ExponentialO(n) guaranteedO(n) guaranteed

The practical guidance from this table: if you are building a feature where user-supplied input is matched against a pattern (search, routing, form validation), reach for Go or Rust (or node-re2 in Node.js) to get RE2's linear-time guarantee. Reserve Python's re or JavaScript's built-in engine for patterns you control — log processing, internal data transformation, where inputs are trusted.

For ready-to-use production patterns in JavaScript and Python, see our regex cheat sheet and Python regex guide. For validated email/URL/phone patterns with real-world trade-off analysis, see regex validation patterns.

Part 8: Practical Regex Patterns You Will Actually Use

The real test of regex knowledge is building patterns for messy, real-world data. Here are production-tested patterns with commentary on the trade-offs:

Production patterns — with commentary
// Semantic version (semver: major.minor.patch, optional pre-release)
const SEMVER = /^(?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>[\w.-]+))?$/
SEMVER.test('1.0.0')        // true
SEMVER.test('2.1.0-beta.1') // true
SEMVER.test('1.01.0')       // false (leading zero)

// ISO 8601 date (basic — not exhaustive)
const ISO_DATE = /^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/
ISO_DATE.test('2026-04-15') // true
ISO_DATE.test('2026-13-01') // false (month 13)
ISO_DATE.test('2026-04-32') // false (day 32)

// IPv4 address (validates 0–255 range)
const IPV4 = /^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$/
IPV4.test('192.168.1.1')  // true
IPV4.test('256.0.0.1')    // false

// Slugify: URL-safe lowercase string
function slugify(str) {
  return str
    .toLowerCase()
    .replace(/[^\w\s-]/g, '')  // remove non-word, non-space, non-hyphen
    .replace(/[\s_]+/g, '-')   // spaces and underscores → hyphens
    .replace(/--+/g, '-')       // collapse multiple hyphens
    .replace(/^-|-$/g, '')      // trim leading/trailing hyphens
}
slugify('Hello, World! How Are You?')  // 'hello-world-how-are-you'

// Extract all log timestamps (format: [2026-04-15 14:30:00.123])
const LOG_TIMESTAMP = /\[(?<ts>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\]/g
const logLine = '[2026-04-15 14:30:00.123] ERROR: connection refused'
const tsMatch = LOG_TIMESTAMP.exec(logLine)
// tsMatch.groups.ts = '2026-04-15 14:30:00.123'

// Remove HTML tags (safe for simple stripping — use DOMParser for complex HTML)
const stripTags = (html) => html.replace(/<[^>]+>/g, '')
stripTags('<p>Hello <strong>world</strong></p>')  // 'Hello world'

Frequently Asked Questions

What does .* mean in regex?

.* means “match any character (except newline) zero or more times.” The dot matches any single character except \n by default. The asterisk is a greedy quantifier that matches 0 or more repetitions. Combined, .* matches any sequence of characters on a line. Use .*? (lazy) to match as few characters as possible. With the /s flag, dot also matches newlines.

What is the difference between .* and .+?

.* matches zero or more characters (can match an empty string). .+ matches one or more characters (must match at least one). Use .+ when you need to guarantee the pattern captured something non-empty. The same distinction applies to all quantifiers: * vs +, {0,5} vs {1,5}.

What is catastrophic backtracking in regex?

Catastrophic backtracking occurs when a regex with nested quantifiers (like (a+)+) must explore exponentially many match paths to reject a string. A pattern like (a+)+b takes microseconds on valid input but exponential time on a string of 'a' characters without a trailing 'b'. This is the ReDoS (Regular Expression Denial of Service) vulnerability class that caused the 2019 Cloudflare global outage.

What does ^ and $ mean in regex?

^ is the start-of-string anchor. $ is the end-of-string anchor. Together, ^pattern$ forces the entire string to match — not just a substring. Without anchors, /\d+/ matches the digits in “abc123def”. With anchors, /^\d+$/ requires the entire string to be digits. In multiline mode (/m flag), they match line boundaries.

What is a named capture group in regex?

Named capture groups use (?<name>pattern) syntax. In JavaScript (ES2018+): const { year } = /(?<year>\d4)/.exec('2026').groups. Supported in Python ((?P<name>)), PHP, .NET, Java 7+, and PCRE. Named groups eliminate magic index numbers in complex patterns and are essential for maintainable regex with multiple captures.

How is regex different across JavaScript, Python, and Go?

JavaScript and Python use backtracking NFA engines — powerful (backreferences, lookaheads) but vulnerable to ReDoS. Go uses RE2, a linear-time DFA engine that prohibits backreferences and lookaheads for guaranteed O(n) performance. Rust's regex crate also uses RE2. For user-facing input validation at scale, prefer RE2-based engines. For data transformation on trusted input, any engine works.

Test Your Regex Patterns

Use our JavaScript regex guide and cheat sheet to build and validate production patterns. Syntax reference, quantifier tables, and copy-paste validation patterns for email, URL, IP, semver, and more.

Open Regex Cheat Sheet →