Advanced Regular Expressions

Regular expressions (regex) are a powerful tool for describing string patterns. Since ES2018, powerful features such as Named Groups and Lookbehind have been added.

Regular Expression Basics Review

// Creation methods
const re1 = /pattern/flags;
const re2 = new RegExp('pattern', 'flags');  // useful for dynamic patterns

// Key methods
const str = 'Hello, World! Hello, JavaScript!';

// test: returns boolean
/Hello/.test(str);  // true

// match: returns array of matches (only first match without g flag)
str.match(/Hello/);   // ['Hello', index: 0, ...]
str.match(/Hello/g);  // ['Hello', 'Hello']

// matchAll: iterator of all matches (g flag required)
[...str.matchAll(/Hello/g)];

// search: index of first match
str.search(/World/);  // 7

// replace/replaceAll
str.replace(/Hello/g, 'Hi');   // replace all
str.replaceAll('Hello', 'Hi'); // replaceAll (ES2021)

// split
'a,b,,c'.split(/,+/);  // ['a', 'b', 'c']

Basic Patterns

// Character classes
/[abc]/     // one of a, b, or c
/[^abc]/    // excluding a, b, c
/[a-z]/     // lowercase letters
/[A-Z]/     // uppercase letters
/[0-9]/     // digits = \d
/[a-zA-Z0-9_]/ // word character = \w

// Meta characters
/./         // any character except newline
/\d/        // [0-9]
/\D/        // [^0-9]
/\w/        // [a-zA-Z0-9_]
/\W/        // [^\w]
/\s/        // whitespace (space, tab, newline)
/\S/        // non-whitespace

// Quantifiers
/a?/    // 0 or 1
/a*/    // 0 or more
/a+/    // 1 or more
/a{3}/  // exactly 3
/a{2,4}/ // 2 to 4
/a{2,}/  // 2 or more

// Anchors
/^hello/  // start
/world$/  // end
/\bhello\b/  // word boundary

// Groups
/(abc)/    // capture group
/(?:abc)/  // non-capture group

Named Capture Groups (ES2018)

// Old way: number-based capture groups
const dateStr = '2024-03-15';
const match = dateStr.match(/(\d{4})-(\d{2})-(\d{2})/);
// match[1] = '2024', match[2] = '03', match[3] = '15'
// Order-dependent, making maintenance difficult

// Named Capture Groups: (?<name>pattern)
const namedMatch = dateStr.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
const { year, month, day } = namedMatch.groups;
// year = '2024', month = '03', day = '15'

// Intuitive and self-documenting
const TIME_REGEX = /(?<hour>\d{2}):(?<minute>\d{2})(?::(?<second>\d{2}))?/;
const timeMatch = '14:30:45'.match(TIME_REGEX);
const { hour, minute, second = '00' } = timeMatch.groups;

// Referencing Named Groups in replace
const formatted = '2024-03-15'.replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  '$<month>/$<day>/$<year>'  // MM/DD/YYYY format
);
// '03/15/2024'

// Replace with function
const kebabToCamel = str => str.replace(
  /-(?<char>[a-z])/g,
  (_, char) => char.toUpperCase()
);
kebabToCamel('hello-world-foo');  // 'helloWorldFoo'

// matchAll combined with Named Groups
const text = 'John Doe (30), Jane Smith (25)';
const PERSON_REGEX = /(?<name>[A-Z][a-z]+ [A-Z][a-z]+) \((?<age>\d+)\)/g;

for (const match of text.matchAll(PERSON_REGEX)) {
  const { name, age } = match.groups;
  console.log(`${name}: ${age} years old`);
}
// John Doe: 30 years old
// Jane Smith: 25 years old

Lookbehind Assertions (ES2018)

Match patterns based on what precedes them (the preceding part itself is not included in the match)

// Existing Lookahead
// (?=pattern) : pattern follows
// (?!pattern) : pattern does not follow

// \d(?=px) : digits before px
'font-size: 16px'.match(/\d+(?=px)/)?.[0];  // '16'

// ES2018: Lookbehind
// (?<=pattern) : pattern precedes (positive lookbehind)
// (?<!pattern) : pattern does not precede (negative lookbehind)

// (?<=$)\d+ : digits after $ (excluding dollar sign)
const prices = 'Apple: $1.50, Banana: $0.75, Cherry: $2.00';
const dollarAmounts = [...prices.matchAll(/(?<=\$)\d+\.\d+/g)].map(m => m[0]);
// ['1.50', '0.75', '2.00']

// (?<!-)\d+ : digits without a negative sign
const numbers = '100 -50 200 -30 400';
const positives = [...numbers.matchAll(/(?<!-)\d+/g)].map(m => m[0]);
// ['100', '200', '400']

// Positive lookbehind to strip prefix
const cssValues = ['padding: 16px', 'margin: 8px', 'font-size: 14px'];
cssValues.map(s => s.match(/(?<=: )\d+/)?.[0]);
// ['16', '8', '14']

// Negative lookbehind: extract domains from HTTPS URLs only
const urls = ['https://example.com', 'http://test.com', 'ftp://files.com'];
const httpsOnly = urls.filter(url => /(?<=https:\/\/)\w+/.test(url));
// ['https://example.com']

dotAll Flag s (ES2018)

// By default, . does not match newlines (\n)
const multiline = `Hello
World`;

/Hello.World/.test(multiline);    // false
/Hello[\s\S]World/.test(multiline); // true (old workaround)

// s flag: . also matches newlines
/Hello.World/s.test(multiline);   // true

// Real-world: extract content from HTML tags
const html = `<div>
  <p>First paragraph</p>
  <p>Second paragraph</p>
</div>`;

// Difficult to match multiple lines without s flag
const match = html.match(/<div>(.*?)<\/div>/s);
// match[1]: '\n  <p>...</p>\n  <p>...</p>\n'

// Extract all paragraphs
const paragraphs = [...html.matchAll(/<p>(.*?)<\/p>/gs)].map(m => m[1]);
// ['First paragraph', 'Second paragraph']

// Remove comments
const code = `
/* This is
   a multi-line
   comment */
const x = 1; /* inline comment */
`;
code.replace(/\/\*.*?\*\//gs, '').trim();
// 'const x = 1;'

Unicode Flag u (ES2015+)

// u flag: full Unicode support
// handles emoji, special characters, characters beyond BMP

// Without u, surrogate pairs are treated as 2 characters
/^.$/.test('😀');  // false (emoji is 2 code units)
/^.$/u.test('😀'); // true

// Unicode escapes
/\u{1F600}/u.test('😀');  // true

// Unicode category matching (\p{...})
// used with u flag
/\p{Emoji}/u.test('😀');         // true
/\p{Letter}/u.test('A');          // true
/\p{Decimal_Number}/u.test('5'); // true
/\p{Korean}/u.test('한');          // true, etc.

// \P: opposite (uppercase P)
/\P{Emoji}/u.test('A');  // true (not an emoji)

// Real-world: remove emojis
const withEmojis = 'Hello 😀 World 🌍!';
const withoutEmojis = withEmojis.replace(/\p{Emoji}/gu, '').trim();
// 'Hello  World !'

// Extract Korean only
const mixed = 'Hello 안녕 World 세상';
const korean = mixed.match(/\p{Script=Hangul}+/gu) ?? [];
// ['안녕', '세상']

Sticky Flag y

// y flag: matches only at the lastIndex position
const str = 'aababab';
const re = /a/y;

re.lastIndex = 0;
re.test(str);  // true (index 0: 'a')
re.lastIndex;  // 1

re.test(str);  // true (index 1: 'a')
re.lastIndex;  // 2

// g vs y difference
const reG = /\d+/g;
const reY = /\d+/y;
const numStr = '123 456 789';

reG.lastIndex = 4;
reG.exec(numStr)?.[0];  // '456' (anywhere after position 4)

reY.lastIndex = 4;
reY.exec(numStr)?.[0];  // '456' (exactly from position 4)

reY.lastIndex = 3;
reY.exec(numStr);  // null (position 3 is a space, match fails)

// Real-world: tokenizer (useful for parser implementation)
function tokenize(input) {
  const tokens = [];
  const patterns = {
    NUMBER: /\d+/y,
    STRING: /"[^"]*"/y,
    KEYWORD: /\b(if|else|while|for)\b/y,
    IDENTIFIER: /[a-zA-Z_]\w*/y,
    WHITESPACE: /\s+/y,
  };

  let pos = 0;
  while (pos < input.length) {
    let matched = false;

    for (const [type, pattern] of Object.entries(patterns)) {
      pattern.lastIndex = pos;
      const match = pattern.exec(input);

      if (match) {
        if (type !== 'WHITESPACE') {
          tokens.push({ type, value: match[0] });
        }
        pos += match[0].length;
        matched = true;
        break;
      }
    }

    if (!matched) throw new SyntaxError(`Unexpected character: ${input[pos]}`);
  }

  return tokens;
}

String.matchAll()

// Iterator of all matches for a regex with g flag
const text = '2024-01-15, 2024-06-20, 2024-12-31';

// exec loop approach (cumbersome)
const regex1 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates1 = [];
let match;
while ((match = regex1.exec(text)) !== null) {
  dates1.push(match);
}

// matchAll approach (concise)
const regex2 = /(\d{4})-(\d{2})-(\d{2})/g;
const dates2 = [...text.matchAll(regex2)];

// Named Groups with matchAll
const ADDR_RE = /(?<street>[^,]+),\s*(?<city>[^,]+),\s*(?<state>[A-Z]{2})/g;
const addresses = `
123 Main St, Springfield, IL
456 Oak Ave, Chicago, IL
789 Pine Rd, Naperville, IL
`;

for (const { groups } of addresses.matchAll(ADDR_RE)) {
  console.log(`${groups.street} → ${groups.city}, ${groups.state}`);
}

Real-world: URL Parsing

// URL parsing regex
const URL_REGEX = /^(?<protocol>https?):\/\/(?<host>[^/:?#]+)(?::(?<port>\d+))?(?<path>\/[^?#]*)?(?:\?(?<query>[^#]*))?(?:#(?<fragment>.*))?$/;

function parseUrl(url) {
  const match = url.match(URL_REGEX);
  if (!match) throw new Error('Invalid URL');

  const { protocol, host, port, path = '/', query = '', fragment = '' } = match.groups;

  return {
    protocol,
    host,
    port: port ? parseInt(port) : (protocol === 'https' ? 443 : 80),
    path,
    query: parseQuery(query),
    fragment,
    href: url
  };
}

function parseQuery(queryStr) {
  if (!queryStr) return {};
  return Object.fromEntries(
    queryStr.split('&').map(pair => {
      const [key, value = ''] = pair.split('=');
      return [decodeURIComponent(key), decodeURIComponent(value)];
    })
  );
}

const parsed = parseUrl('https://api.example.com:8080/users?page=1&limit=10#results');
console.log(parsed);

Real-world: Email Validation

// Based on RFC 5322 (full validation is very complex)
const EMAIL_REGEX = /^(?<local>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)@(?<domain>[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)$/;

function validateEmail(email) {
  const match = email.match(EMAIL_REGEX);
  if (!match) return { valid: false, reason: 'Format error' };

  const { local, domain } = match.groups;
  if (local.length > 64) return { valid: false, reason: 'Local part too long' };
  if (domain.length > 255) return { valid: false, reason: 'Domain too long' };
  if (!domain.includes('.')) return { valid: false, reason: 'No TLD' };

  return { valid: true, local, domain };
}

validateEmail('alice@example.com');    // { valid: true, ... }
validateEmail('not-an-email');         // { valid: false, reason: 'Format error' }

Real-world: Date Extraction

// Parse various date formats
const DATE_PATTERNS = {
  ISO: /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g,
  US: /(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4})/g,
  WRITTEN: /(?<year>\d{4})\s*year\s*(?<month>\d{1,2})\s*month\s*(?<day>\d{1,2})\s*day/g,
};

function extractDates(text) {
  const results = [];

  for (const [format, regex] of Object.entries(DATE_PATTERNS)) {
    for (const match of text.matchAll(regex)) {
      const { year, month, day } = match.groups;
      results.push({
        format,
        original: match[0],
        date: new Date(Number(year), Number(month) - 1, Number(day)),
        index: match.index
      });
    }
  }

  return results.sort((a, b) => a.index - b.index);
}

const document = 'Meeting starts on 2024-03-15, continues until 3/20/2024. Final report due 2024 year 4 month 1 day.';
const dates = extractDates(document);
dates.forEach(d => console.log(`${d.format}: ${d.original} → ${d.date.toLocaleDateString()}`));

Expert Tips

1. Caching compiled regular expressions

// Pre-compile regular expressions for repeated use
const PATTERNS = {
  EMAIL: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
  PHONE: /^\+?[\d\s-]{10,}$/,
  URL: /^https?:\/\/.+/
};

// Never create new RegExp inside a loop
function validateField(value, type) {
  return PATTERNS[type]?.test(value) ?? false;
}

2. Commenting complex regular expressions

// Improve readability without verbose mode in JS
const PASSWORD_REGEX = new RegExp([
  '^',
  '(?=.*[a-z])',     // contains lowercase
  '(?=.*[A-Z])',     // contains uppercase
  '(?=.*\\d)',       // contains digit
  '(?=.*[!@#$%^&*])', // contains special character
  '.{8,}',           // at least 8 characters
  '$'
].join(''));

PASSWORD_REGEX.test('Passw0rd!');  // true

3. Regular expression performance optimization

// Prevent catastrophic backtracking
// Bad example (exponential complexity)
/^(a+)+$/.test('aaaaaab');  // very slow!

// Good example (atomic group pattern)
/^a+$/.test('aaaaaab');  // fast

// Use anchors to prevent unnecessary backtracking
/^prefix/.test(str);  // search from start only
/suffix$/.test(str);  // search from end only

Regular Expression Basics Review​

Basic Patterns​

Named Capture Groups (ES2018)​

Lookbehind Assertions (ES2018)​

dotAll Flag s (ES2018)​

Unicode Flag u (ES2015+)​

Sticky Flag y​

String.matchAll()​

Real-world: URL Parsing​

Real-world: Email Validation​

Real-world: Date Extraction​

Expert Tips​