Advanced Regular Expressions
Regular expressions (regex) are a powerful tool for describing string patterns. Since ES2018, powerful features such as Named Groups and Lookbehind have been added.
Regular Expression Basics Review
// Creation methods
const re1 = /pattern/flags;
const re2 = new RegExp('pattern', 'flags'); // useful for dynamic patterns
// Key methods
const str = 'Hello, World! Hello, JavaScript!';
// test: returns boolean
/Hello/.test(str); // true
// match: returns array of matches (only first match without g flag)
str.match(/Hello/); // ['Hello', index: 0, ...]
str.match(/Hello/g); // ['Hello', 'Hello']
// matchAll: iterator of all matches (g flag required)
[...str.matchAll(/Hello/g)];
// search: index of first match
str.search(/World/); // 7
// replace/replaceAll
str.replace(/Hello/g, 'Hi'); // replace all
str.replaceAll('Hello', 'Hi'); // replaceAll (ES2021)
// split
'a,b,,c'.split(/,+/); // ['a', 'b', 'c']
Basic Patterns
// Character classes
/[abc]/ // one of a, b, or c
/[^abc]/ // excluding a, b, c
/[a-z]/ // lowercase letters
/[A-Z]/ // uppercase letters
/[0-9]/ // digits = \d
/[a-zA-Z0-9_]/ // word character = \w
// Meta characters
/./ // any character except newline
/\d/ // [0-9]
/\D/ // [^0-9]
/\w/ // [a-zA-Z0-9_]
/\W/ // [^\w]
/\s/ // whitespace (space, tab, newline)
/\S/ // non-whitespace
// Quantifiers
/a?/ // 0 or 1
/a*/ // 0 or more
/a+/ // 1 or more
/a{3}/ // exactly 3
/a{2,4}/ // 2 to 4
/a{2,}/ // 2 or more
// Anchors
/^hello/ // start
/world$/ // end
/\bhello\b/ // word boundary
// Groups
/(abc)/ // capture group
/(?:abc)/ // non-capture group
Named Capture Groups (ES2018)
// Old way: number-based capture groups
const dateStr = '2024-03-15';
const match = dateStr.match(/(\d{4})-(\d{2})-(\d{2})/);
// match[1] = '2024', match[2] = '03', match[3] = '15'
// Order-dependent, making maintenance difficult
// Named Capture Groups: (?<name>pattern)
const namedMatch = dateStr.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
const { year, month, day } = namedMatch.groups;
// year = '2024', month = '03', day = '15'
// Intuitive and self-documenting
const TIME_REGEX = /(?<hour>\d{2}):(?<minute>\d{2})(?::(?<second>\d{2}))?/;
const timeMatch = '14:30:45'.match(TIME_REGEX);
const { hour, minute, second = '00' } = timeMatch.groups;
// Referencing Named Groups in replace
const formatted = '2024-03-15'.replace(
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
'$<month>/$<day>/$<year>' // MM/DD/YYYY format
);
// '03/15/2024'
// Replace with function
const kebabToCamel = str => str.replace(
/-(?<char>[a-z])/g,
(_, char) => char.toUpperCase()
);
kebabToCamel('hello-world-foo'); // 'helloWorldFoo'
// matchAll combined with Named Groups
const text = 'John Doe (30), Jane Smith (25)';
const PERSON_REGEX = /(?<name>[A-Z][a-z]+ [A-Z][a-z]+) \((?<age>\d+)\)/g;
for (const match of text.matchAll(PERSON_REGEX)) {
const { name, age } = match.groups;
console.log(`${name}: ${age} years old`);
}
// John Doe: 30 years old
// Jane Smith: 25 years old
Lookbehind Assertions (ES2018)
Match patterns based on what precedes them (the preceding part itself is not included in the match)
// Existing Lookahead
// (?=pattern) : pattern follows
// (?!pattern) : pattern does not follow
// \d(?=px) : digits before px
'font-size: 16px'.match(/\d+(?=px)/)?.[0]; // '16'
// ES2018: Lookbehind
// (?<=pattern) : pattern precedes (positive lookbehind)
// (?<!pattern) : pattern does not precede (negative lookbehind)
// (?<=$)\d+ : digits after $ (excluding dollar sign)
const prices = 'Apple: $1.50, Banana: $0.75, Cherry: $2.00';
const dollarAmounts = [...prices.matchAll(/(?<=\$)\d+\.\d+/g)].map(m => m[0]);
// ['1.50', '0.75', '2.00']
// (?<!-)\d+ : digits without a negative sign
const numbers = '100 -50 200 -30 400';
const positives = [...numbers.matchAll(/(?<!-)\d+/g)].map(m => m[0]);
// ['100', '200', '400']
// Positive lookbehind to strip prefix
const cssValues = ['padding: 16px', 'margin: 8px', 'font-size: 14px'];
cssValues.map(s => s.match(/(?<=: )\d+/)?.[0]);
// ['16', '8', '14']
// Negative lookbehind: extract domains from HTTPS URLs only
const urls = ['https://example.com', 'http://test.com', 'ftp://files.com'];
const httpsOnly = urls.filter(url => /(?<=https:\/\/)\w+/.test(url));
// ['https://example.com']
dotAll Flag s (ES2018)
// By default, . does not match newlines (\n)
const multiline = `Hello
World`;
/Hello.World/.test(multiline); // false
/Hello[\s\S]World/.test(multiline); // true (old workaround)
// s flag: . also matches newlines
/Hello.World/s.test(multiline); // true
// Real-world: extract content from HTML tags
const html = `<div>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>`;
// Difficult to match multiple lines without s flag
const match = html.match(/<div>(.*?)<\/div>/s);
// match[1]: '\n <p>...</p>\n <p>...</p>\n'
// Extract all paragraphs
const paragraphs = [...html.matchAll(/<p>(.*?)<\/p>/gs)].map(m => m[1]);
// ['First paragraph', 'Second paragraph']
// Remove comments
const code = `
/* This is
a multi-line
comment */
const x = 1; /* inline comment */
`;
code.replace(/\/\*.*?\*\//gs, '').trim();
// 'const x = 1;'
Unicode Flag u (ES2015+)
// u flag: full Unicode support
// handles emoji, special characters, characters beyond BMP
// Without u, surrogate pairs are treated as 2 characters
/^.$/.test('😀'); // false (emoji is 2 code units)
/^.$/u.test('😀'); // true
// Unicode escapes
/\u{1F600}/u.test('😀'); // true
// Unicode category matching (\p{...})
// used with u flag
/\p{Emoji}/u.test('😀'); // true
/\p{Letter}/u.test('A'); // true
/\p{Decimal_Number}/u.test('5'); // true
/\p{Korean}/u.test('한'); // true, etc.
// \P: opposite (uppercase P)
/\P{Emoji}/u.test('A'); // true (not an emoji)
// Real-world: remove emojis
const withEmojis = 'Hello 😀 World 🌍!';
const withoutEmojis = withEmojis.replace(/\p{Emoji}/gu, '').trim();
// 'Hello World !'
// Extract Korean only
const mixed = 'Hello 안녕 World 세상';
const korean = mixed.match(/\p{Script=Hangul}+/gu) ?? [];
// ['안녕', '세상']
Sticky Flag y
// y flag: matches only at the lastIndex position
const str = 'aababab';
const re = /a/y;
re.lastIndex = 0;
re.test(str); // true (index 0: 'a')
re.lastIndex; // 1
re.test(str); // true (index 1: 'a')
re.lastIndex; // 2
// g vs y difference
const reG = /\d+/g;
const reY = /\d+/y;
const numStr = '123 456 789';
reG.lastIndex = 4;
reG.exec(numStr)?.[0]; // '456' (anywhere after position 4)
reY.lastIndex = 4;
reY.exec(numStr)?.[0]; // '456' (exactly from position 4)
reY.lastIndex = 3;
reY.exec(numStr); // null (position 3 is a space, match fails)
// Real-world: tokenizer (useful for parser implementation)
function tokenize(input) {
const tokens = [];
const patterns = {
NUMBER: /\d+/y,
STRING: /"[^"]*"/y,
KEYWORD: /\b(if|else|while|for)\b/y,
IDENTIFIER: /[a-zA-Z_]\w*/y,
WHITESPACE: /\s+/y,
};
let pos = 0;
while (pos < input.length) {
let matched = false;
for (const [type, pattern] of Object.entries(patterns)) {
pattern.lastIndex = pos;
const match = pattern.exec(input);
if (match) {
if (type !== 'WHITESPACE') {
tokens.push({ type, value: match[0] });
}
pos += match[0].length;
matched = true;
break;
}
}
if (!matched) throw new SyntaxError(`Unexpected character: ${input[pos]}`);
}
return tokens;
}
String.matchAll()
// Iterator of all matches for a regex with g flag
const text = '2024-01-15, 2024-06-20, 2024-12-31';
// exec loop approach (cumbersome)
const regex1 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates1 = [];
let match;
while ((match = regex1.exec(text)) !== null) {
dates1.push(match);
}
// matchAll approach (concise)
const regex2 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates2 = [...text.matchAll(regex2)];
// Named Groups with matchAll
const ADDR_RE = /(?<street>[^,]+),\s*(?<city>[^,]+),\s*(?<state>[A-Z]{2})/g;
const addresses = `
123 Main St, Springfield, IL
456 Oak Ave, Chicago, IL
789 Pine Rd, Naperville, IL
`;
for (const { groups } of addresses.matchAll(ADDR_RE)) {
console.log(`${groups.street} → ${groups.city}, ${groups.state}`);
}
Real-world: URL Parsing
// URL parsing regex
const URL_REGEX = /^(?<protocol>https?):\/\/(?<host>[^/:?#]+)(?::(?<port>\d+))?(?<path>\/[^?#]*)?(?:\?(?<query>[^#]*))?(?:#(?<fragment>.*))?$/;
function parseUrl(url) {
const match = url.match(URL_REGEX);
if (!match) throw new Error('Invalid URL');
const { protocol, host, port, path = '/', query = '', fragment = '' } = match.groups;
return {
protocol,
host,
port: port ? parseInt(port) : (protocol === 'https' ? 443 : 80),
path,
query: parseQuery(query),
fragment,
href: url
};
}
function parseQuery(queryStr) {
if (!queryStr) return {};
return Object.fromEntries(
queryStr.split('&').map(pair => {
const [key, value = ''] = pair.split('=');
return [decodeURIComponent(key), decodeURIComponent(value)];
})
);
}
const parsed = parseUrl('https://api.example.com:8080/users?page=1&limit=10#results');
console.log(parsed);
Real-world: Email Validation
// Based on RFC 5322 (full validation is very complex)
const EMAIL_REGEX = /^(?<local>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)@(?<domain>[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)$/;
function validateEmail(email) {
const match = email.match(EMAIL_REGEX);
if (!match) return { valid: false, reason: 'Format error' };
const { local, domain } = match.groups;
if (local.length > 64) return { valid: false, reason: 'Local part too long' };
if (domain.length > 255) return { valid: false, reason: 'Domain too long' };
if (!domain.includes('.')) return { valid: false, reason: 'No TLD' };
return { valid: true, local, domain };
}
validateEmail('alice@example.com'); // { valid: true, ... }
validateEmail('not-an-email'); // { valid: false, reason: 'Format error' }
Real-world: Date Extraction
// Parse various date formats
const DATE_PATTERNS = {
ISO: /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g,
US: /(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4})/g,
WRITTEN: /(?<year>\d{4})\s*year\s*(?<month>\d{1,2})\s*month\s*(?<day>\d{1,2})\s*day/g,
};
function extractDates(text) {
const results = [];
for (const [format, regex] of Object.entries(DATE_PATTERNS)) {
for (const match of text.matchAll(regex)) {
const { year, month, day } = match.groups;
results.push({
format,
original: match[0],
date: new Date(Number(year), Number(month) - 1, Number(day)),
index: match.index
});
}
}
return results.sort((a, b) => a.index - b.index);
}
const document = 'Meeting starts on 2024-03-15, continues until 3/20/2024. Final report due 2024 year 4 month 1 day.';
const dates = extractDates(document);
dates.forEach(d => console.log(`${d.format}: ${d.original} → ${d.date.toLocaleDateString()}`));
Expert Tips
1. Caching compiled regular expressions
// Pre-compile regular expressions for repeated use
const PATTERNS = {
EMAIL: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
PHONE: /^\+?[\d\s-]{10,}$/,
URL: /^https?:\/\/.+/
};
// Never create new RegExp inside a loop
function validateField(value, type) {
return PATTERNS[type]?.test(value) ?? false;
}
2. Commenting complex regular expressions
// Improve readability without verbose mode in JS
const PASSWORD_REGEX = new RegExp([
'^',
'(?=.*[a-z])', // contains lowercase
'(?=.*[A-Z])', // contains uppercase
'(?=.*\\d)', // contains digit
'(?=.*[!@#$%^&*])', // contains special character
'.{8,}', // at least 8 characters
'$'
].join(''));
PASSWORD_REGEX.test('Passw0rd!'); // true
3. Regular expression performance optimization
// Prevent catastrophic backtracking
// Bad example (exponential complexity)
/^(a+)+$/.test('aaaaaab'); // very slow!
// Good example (atomic group pattern)
/^a+$/.test('aaaaaab'); // fast
// Use anchors to prevent unnecessary backtracking
/^prefix/.test(str); // search from start only
/suffix$/.test(str); // search from end only