Skip to main content

Advanced Regular Expressions

Regular expressions (regex) are a powerful tool for describing string patterns. Since ES2018, powerful features such as Named Groups and Lookbehind have been added.


Regular Expression Basics Review

// Creation methods
const re1 = /pattern/flags;
const re2 = new RegExp('pattern', 'flags'); // useful for dynamic patterns

// Key methods
const str = 'Hello, World! Hello, JavaScript!';

// test: returns boolean
/Hello/.test(str); // true

// match: returns array of matches (only first match without g flag)
str.match(/Hello/); // ['Hello', index: 0, ...]
str.match(/Hello/g); // ['Hello', 'Hello']

// matchAll: iterator of all matches (g flag required)
[...str.matchAll(/Hello/g)];

// search: index of first match
str.search(/World/); // 7

// replace/replaceAll
str.replace(/Hello/g, 'Hi'); // replace all
str.replaceAll('Hello', 'Hi'); // replaceAll (ES2021)

// split
'a,b,,c'.split(/,+/); // ['a', 'b', 'c']

Basic Patterns

// Character classes
/[abc]/ // one of a, b, or c
/[^abc]/ // excluding a, b, c
/[a-z]/ // lowercase letters
/[A-Z]/ // uppercase letters
/[0-9]/ // digits = \d
/[a-zA-Z0-9_]/ // word character = \w

// Meta characters
/./ // any character except newline
/\d/ // [0-9]
/\D/ // [^0-9]
/\w/ // [a-zA-Z0-9_]
/\W/ // [^\w]
/\s/ // whitespace (space, tab, newline)
/\S/ // non-whitespace

// Quantifiers
/a?/ // 0 or 1
/a*/ // 0 or more
/a+/ // 1 or more
/a{3}/ // exactly 3
/a{2,4}/ // 2 to 4
/a{2,}/ // 2 or more

// Anchors
/^hello/ // start
/world$/ // end
/\bhello\b/ // word boundary

// Groups
/(abc)/ // capture group
/(?:abc)/ // non-capture group

Named Capture Groups (ES2018)

// Old way: number-based capture groups
const dateStr = '2024-03-15';
const match = dateStr.match(/(\d{4})-(\d{2})-(\d{2})/);
// match[1] = '2024', match[2] = '03', match[3] = '15'
// Order-dependent, making maintenance difficult

// Named Capture Groups: (?<name>pattern)
const namedMatch = dateStr.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
const { year, month, day } = namedMatch.groups;
// year = '2024', month = '03', day = '15'

// Intuitive and self-documenting
const TIME_REGEX = /(?<hour>\d{2}):(?<minute>\d{2})(?::(?<second>\d{2}))?/;
const timeMatch = '14:30:45'.match(TIME_REGEX);
const { hour, minute, second = '00' } = timeMatch.groups;

// Referencing Named Groups in replace
const formatted = '2024-03-15'.replace(
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
'$<month>/$<day>/$<year>' // MM/DD/YYYY format
);
// '03/15/2024'

// Replace with function
const kebabToCamel = str => str.replace(
/-(?<char>[a-z])/g,
(_, char) => char.toUpperCase()
);
kebabToCamel('hello-world-foo'); // 'helloWorldFoo'

// matchAll combined with Named Groups
const text = 'John Doe (30), Jane Smith (25)';
const PERSON_REGEX = /(?<name>[A-Z][a-z]+ [A-Z][a-z]+) \((?<age>\d+)\)/g;

for (const match of text.matchAll(PERSON_REGEX)) {
const { name, age } = match.groups;
console.log(`${name}: ${age} years old`);
}
// John Doe: 30 years old
// Jane Smith: 25 years old

Lookbehind Assertions (ES2018)

Match patterns based on what precedes them (the preceding part itself is not included in the match)

// Existing Lookahead
// (?=pattern) : pattern follows
// (?!pattern) : pattern does not follow

// \d(?=px) : digits before px
'font-size: 16px'.match(/\d+(?=px)/)?.[0]; // '16'

// ES2018: Lookbehind
// (?<=pattern) : pattern precedes (positive lookbehind)
// (?<!pattern) : pattern does not precede (negative lookbehind)

// (?<=$)\d+ : digits after $ (excluding dollar sign)
const prices = 'Apple: $1.50, Banana: $0.75, Cherry: $2.00';
const dollarAmounts = [...prices.matchAll(/(?<=\$)\d+\.\d+/g)].map(m => m[0]);
// ['1.50', '0.75', '2.00']

// (?<!-)\d+ : digits without a negative sign
const numbers = '100 -50 200 -30 400';
const positives = [...numbers.matchAll(/(?<!-)\d+/g)].map(m => m[0]);
// ['100', '200', '400']

// Positive lookbehind to strip prefix
const cssValues = ['padding: 16px', 'margin: 8px', 'font-size: 14px'];
cssValues.map(s => s.match(/(?<=: )\d+/)?.[0]);
// ['16', '8', '14']

// Negative lookbehind: extract domains from HTTPS URLs only
const urls = ['https://example.com', 'http://test.com', 'ftp://files.com'];
const httpsOnly = urls.filter(url => /(?<=https:\/\/)\w+/.test(url));
// ['https://example.com']

dotAll Flag s (ES2018)

// By default, . does not match newlines (\n)
const multiline = `Hello
World`;

/Hello.World/.test(multiline); // false
/Hello[\s\S]World/.test(multiline); // true (old workaround)

// s flag: . also matches newlines
/Hello.World/s.test(multiline); // true

// Real-world: extract content from HTML tags
const html = `<div>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>`;

// Difficult to match multiple lines without s flag
const match = html.match(/<div>(.*?)<\/div>/s);
// match[1]: '\n <p>...</p>\n <p>...</p>\n'

// Extract all paragraphs
const paragraphs = [...html.matchAll(/<p>(.*?)<\/p>/gs)].map(m => m[1]);
// ['First paragraph', 'Second paragraph']

// Remove comments
const code = `
/* This is
a multi-line
comment */
const x = 1; /* inline comment */
`;
code.replace(/\/\*.*?\*\//gs, '').trim();
// 'const x = 1;'

Unicode Flag u (ES2015+)

// u flag: full Unicode support
// handles emoji, special characters, characters beyond BMP

// Without u, surrogate pairs are treated as 2 characters
/^.$/.test('😀'); // false (emoji is 2 code units)
/^.$/u.test('😀'); // true

// Unicode escapes
/\u{1F600}/u.test('😀'); // true

// Unicode category matching (\p{...})
// used with u flag
/\p{Emoji}/u.test('😀'); // true
/\p{Letter}/u.test('A'); // true
/\p{Decimal_Number}/u.test('5'); // true
/\p{Korean}/u.test('한'); // true, etc.

// \P: opposite (uppercase P)
/\P{Emoji}/u.test('A'); // true (not an emoji)

// Real-world: remove emojis
const withEmojis = 'Hello 😀 World 🌍!';
const withoutEmojis = withEmojis.replace(/\p{Emoji}/gu, '').trim();
// 'Hello World !'

// Extract Korean only
const mixed = 'Hello 안녕 World 세상';
const korean = mixed.match(/\p{Script=Hangul}+/gu) ?? [];
// ['안녕', '세상']

Sticky Flag y

// y flag: matches only at the lastIndex position
const str = 'aababab';
const re = /a/y;

re.lastIndex = 0;
re.test(str); // true (index 0: 'a')
re.lastIndex; // 1

re.test(str); // true (index 1: 'a')
re.lastIndex; // 2

// g vs y difference
const reG = /\d+/g;
const reY = /\d+/y;
const numStr = '123 456 789';

reG.lastIndex = 4;
reG.exec(numStr)?.[0]; // '456' (anywhere after position 4)

reY.lastIndex = 4;
reY.exec(numStr)?.[0]; // '456' (exactly from position 4)

reY.lastIndex = 3;
reY.exec(numStr); // null (position 3 is a space, match fails)

// Real-world: tokenizer (useful for parser implementation)
function tokenize(input) {
const tokens = [];
const patterns = {
NUMBER: /\d+/y,
STRING: /"[^"]*"/y,
KEYWORD: /\b(if|else|while|for)\b/y,
IDENTIFIER: /[a-zA-Z_]\w*/y,
WHITESPACE: /\s+/y,
};

let pos = 0;
while (pos < input.length) {
let matched = false;

for (const [type, pattern] of Object.entries(patterns)) {
pattern.lastIndex = pos;
const match = pattern.exec(input);

if (match) {
if (type !== 'WHITESPACE') {
tokens.push({ type, value: match[0] });
}
pos += match[0].length;
matched = true;
break;
}
}

if (!matched) throw new SyntaxError(`Unexpected character: ${input[pos]}`);
}

return tokens;
}

String.matchAll()

// Iterator of all matches for a regex with g flag
const text = '2024-01-15, 2024-06-20, 2024-12-31';

// exec loop approach (cumbersome)
const regex1 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates1 = [];
let match;
while ((match = regex1.exec(text)) !== null) {
dates1.push(match);
}

// matchAll approach (concise)
const regex2 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates2 = [...text.matchAll(regex2)];

// Named Groups with matchAll
const ADDR_RE = /(?<street>[^,]+),\s*(?<city>[^,]+),\s*(?<state>[A-Z]{2})/g;
const addresses = `
123 Main St, Springfield, IL
456 Oak Ave, Chicago, IL
789 Pine Rd, Naperville, IL
`;

for (const { groups } of addresses.matchAll(ADDR_RE)) {
console.log(`${groups.street}${groups.city}, ${groups.state}`);
}

Real-world: URL Parsing

// URL parsing regex
const URL_REGEX = /^(?<protocol>https?):\/\/(?<host>[^/:?#]+)(?::(?<port>\d+))?(?<path>\/[^?#]*)?(?:\?(?<query>[^#]*))?(?:#(?<fragment>.*))?$/;

function parseUrl(url) {
const match = url.match(URL_REGEX);
if (!match) throw new Error('Invalid URL');

const { protocol, host, port, path = '/', query = '', fragment = '' } = match.groups;

return {
protocol,
host,
port: port ? parseInt(port) : (protocol === 'https' ? 443 : 80),
path,
query: parseQuery(query),
fragment,
href: url
};
}

function parseQuery(queryStr) {
if (!queryStr) return {};
return Object.fromEntries(
queryStr.split('&').map(pair => {
const [key, value = ''] = pair.split('=');
return [decodeURIComponent(key), decodeURIComponent(value)];
})
);
}

const parsed = parseUrl('https://api.example.com:8080/users?page=1&limit=10#results');
console.log(parsed);

Real-world: Email Validation

// Based on RFC 5322 (full validation is very complex)
const EMAIL_REGEX = /^(?<local>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)@(?<domain>[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)$/;

function validateEmail(email) {
const match = email.match(EMAIL_REGEX);
if (!match) return { valid: false, reason: 'Format error' };

const { local, domain } = match.groups;
if (local.length > 64) return { valid: false, reason: 'Local part too long' };
if (domain.length > 255) return { valid: false, reason: 'Domain too long' };
if (!domain.includes('.')) return { valid: false, reason: 'No TLD' };

return { valid: true, local, domain };
}

validateEmail('alice@example.com'); // { valid: true, ... }
validateEmail('not-an-email'); // { valid: false, reason: 'Format error' }

Real-world: Date Extraction

// Parse various date formats
const DATE_PATTERNS = {
ISO: /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g,
US: /(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4})/g,
WRITTEN: /(?<year>\d{4})\s*year\s*(?<month>\d{1,2})\s*month\s*(?<day>\d{1,2})\s*day/g,
};

function extractDates(text) {
const results = [];

for (const [format, regex] of Object.entries(DATE_PATTERNS)) {
for (const match of text.matchAll(regex)) {
const { year, month, day } = match.groups;
results.push({
format,
original: match[0],
date: new Date(Number(year), Number(month) - 1, Number(day)),
index: match.index
});
}
}

return results.sort((a, b) => a.index - b.index);
}

const document = 'Meeting starts on 2024-03-15, continues until 3/20/2024. Final report due 2024 year 4 month 1 day.';
const dates = extractDates(document);
dates.forEach(d => console.log(`${d.format}: ${d.original}${d.date.toLocaleDateString()}`));

Expert Tips

1. Caching compiled regular expressions

// Pre-compile regular expressions for repeated use
const PATTERNS = {
EMAIL: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
PHONE: /^\+?[\d\s-]{10,}$/,
URL: /^https?:\/\/.+/
};

// Never create new RegExp inside a loop
function validateField(value, type) {
return PATTERNS[type]?.test(value) ?? false;
}

2. Commenting complex regular expressions

// Improve readability without verbose mode in JS
const PASSWORD_REGEX = new RegExp([
'^',
'(?=.*[a-z])', // contains lowercase
'(?=.*[A-Z])', // contains uppercase
'(?=.*\\d)', // contains digit
'(?=.*[!@#$%^&*])', // contains special character
'.{8,}', // at least 8 characters
'$'
].join(''));

PASSWORD_REGEX.test('Passw0rd!'); // true

3. Regular expression performance optimization

// Prevent catastrophic backtracking
// Bad example (exponential complexity)
/^(a+)+$/.test('aaaaaab'); // very slow!

// Good example (atomic group pattern)
/^a+$/.test('aaaaaab'); // fast

// Use anchors to prevent unnecessary backtracking
/^prefix/.test(str); // search from start only
/suffix$/.test(str); // search from end only