How to Use Unicode in JavaScript Regular Expressions

JavaScript strings are encoded in UTF-16, which means many characters that users encounter daily, such as emoji, Chinese characters, mathematical symbols, and characters from dozens of writing systems, are represented as two code units (a surrogate pair) rather than one. Without proper Unicode handling, regular expressions treat these multi-unit characters as two separate characters, leading to broken matches, incorrect string lengths, and patterns that silently fail on international text.

The u and v flags fix this by enabling full Unicode awareness in the regex engine. Beyond correct character handling, they unlock Unicode property escapes, a powerful feature that lets you match characters by their Unicode category: letters from any script, digits from any numeral system, punctuation, emoji, and much more. This guide covers why Unicode mode matters, how to use Unicode property escapes effectively, and what the newer v flag adds with set operations and string properties.

The `u` Flag: Unicode Mode

The u flag switches the JavaScript regex engine into Unicode mode. This changes several fundamental behaviors: how characters are counted, how escape sequences are interpreted, and what features are available.

The Problem Without `u`

Without the u flag, JavaScript regex operates in a legacy mode where strings are treated as sequences of 16-bit code units. Characters outside the Basic Multilingual Plane (BMP), which includes all emoji, many CJK characters, musical symbols, mathematical symbols, and historic scripts, are encoded as two code units called a surrogate pair.

const emoji = '😀';

// JavaScript string sees two code units
console.log(emoji.length); // 2

// Without u: the dot matches ONE code unit (half the emoji)
console.log(emoji.match(/./));
// ["\ud83d"] - matches only the first surrogate, not the full emoji

// Without u: the regex sees two "characters"
console.log(emoji.match(/../));
// ["😀"] - needed TWO dots to match one emoji

// Without u: character class doesn't work correctly
console.log(/^.$/.test(emoji));  // false - emoji is "two characters"
console.log(/^..$/.test(emoji)); // true - two code units

This is clearly wrong. The emoji 😀 is a single character to the user, but the regex engine treats it as two.

Enabling Unicode Mode

Adding the u flag tells the regex engine to treat the input as a sequence of Unicode code points rather than code units. Surrogate pairs are handled as single characters:

const emoji = '😀';

// With u: the dot matches the full code point
console.log(emoji.match(/./u));
// ["😀"] - correctly matches the entire emoji

// With u: one dot = one character
console.log(/^.$/u.test(emoji));  // true - emoji is one character
console.log(/^..$/u.test(emoji)); // false - it's not two characters

Impact on Character Counting

const text = 'Hello 🌍 World';

// Without u: counts code units
console.log(text.match(/./g).length);  // 14 (🌍 counted as 2)

// With u: counts code points
console.log(text.match(/./gu).length); // 13 (🌍 counted as 1)

Impact on Character Ranges in Sets

Without u, using characters outside the BMP in character class ranges can produce errors or unexpected results:

// Without u: surrogate pairs in ranges cause issues
// This may throw or produce unexpected matches
try {
  const regex = /[��-��]/;
  // Behavior is unpredictable without u flag
} catch (e) {
  console.log('Error:', e.message);
}

// With u: ranges work correctly with Unicode characters
const emojiRange = /[��-��]/u;
console.log(emojiRange.test('😀')); // true
console.log(emojiRange.test('😃')); // true
console.log(emojiRange.test('😜')); // true
console.log(emojiRange.test('😡')); // false (outside range)

Strict Escape Handling

The u flag makes the regex engine strict about escape sequences. Invalid escapes that would be silently accepted in legacy mode throw a SyntaxError:

// Without u: \a is not a valid escape, silently treated as literal "a"
console.log(/\a/.test('a')); // true (sloppy behavior)

// With u: \a throws because it's not a recognized escape
try {
  const regex = /\a/u;    // SyntaxError: Invalid escape
} catch (e) {
  console.log(e.message); // Invalid regular expression: /\a/: Invalid escape
}

This strictness helps catch typos and mistakes in patterns:

// Without u: these silently work but may not do what you expect
/\p/.test('p');   // true - \p is treated as literal "p"

// With u: \p without braces is an error (it expects \p{...})
// /\p/u - SyntaxError

// Without u: escaped characters that don't need escaping
/\:/;   // Works (unnecessary escape, treated as literal ":")

// With u: unnecessary escapes are errors
// /\:/u - SyntaxError: Invalid escape

tip

The strict escape handling of the u flag is a feature, not a limitation. It catches mistakes that would otherwise produce subtle bugs. Always use the u flag unless you have a specific reason not to, especially when working with text that may contain non-ASCII characters.

Unicode Code Point Escapes

The u flag enables the \u{XXXX} syntax for specifying Unicode code points by their hex value. Without u, only the four-digit \uXXXX syntax works, which cannot represent characters above U+FFFF:

// Without u: only 4-digit hex escapes work
console.log(/\u0041/.test('A'));     // true (U+0041 = A)
// Cannot express characters above U+FFFF with \uXXXX

// With u: extended \u{XXXXX} syntax available
console.log(/\u{41}/u.test('A'));       // true
console.log(/\u{1F600}/u.test('😀'));   // true (U+1F600 = 😀)
console.log(/\u{1F30D}/u.test('🌍'));   // true (U+1F30D = 🌍)
console.log(/\u{1F4A9}/u.test('💩'));   // true

// Matching specific characters by code point
const checkmark = /\u{2713}/u;
console.log(checkmark.test('✓'));       // true

// Matching a range of code points
const mathSymbols = /[\u{2200}-\u{22FF}]/u;
console.log(mathSymbols.test('∀'));     // true (U+2200 FOR ALL)
console.log(mathSymbols.test('∑'));     // true (U+2211 N-ARY SUMMATION)
console.log(mathSymbols.test('A'));     // false

Quantifiers and Unicode Characters

With the u flag, quantifiers correctly apply to entire Unicode characters rather than individual code units:

const text = '😀😀😀';

// Without u: + applies to the second surrogate of the first emoji
console.log(text.match(/😀+/));
// Unpredictable behavior

// With u: + applies to the full emoji character
console.log(text.match(/😀+/u));
// ["😀😀😀"]

// Counting emoji
console.log(text.match(/😀/gu).length);   // 3

// Matching exactly 2 emoji
console.log(/^😀{2}$/u.test('😀😀'));   // true
console.log(/^😀{2}$/u.test('😀😀😀')); // false

Complete Comparison: With and Without `u`

Behavior	Without `u`	With `u`
`.` matches emoji	No (matches half)	Yes (full character)
Surrogate pairs	Treated as two chars	Treated as one char
`\u{XXXXX}` syntax	Not available	Available
Invalid escapes	Silently accepted	Throw SyntaxError
`\p{...}` properties	Not available	Available
Character ranges with non-BMP	Broken/unpredictable	Correct
Strictness	Sloppy	Strict

Unicode Properties: `\p{...}` and `\P{...}`

Unicode property escapes are the most powerful feature unlocked by the u flag. Every character in Unicode has a set of properties that describe what it is: a letter, a digit, a punctuation mark, a symbol, which script it belongs to, and more. The \p{...} syntax lets you match characters based on these properties instead of listing specific character ranges.

Basic Syntax

\p{PropertyName} matches any character that has the specified property (or property value)
\P{PropertyName} matches any character that does not have the property (inverse)

Both require the u or v flag.

// Match any Unicode letter
console.log('café'.match(/\p{Letter}+/gu));
// ["café"] - includes the accented é

// Compare with \w which misses non-ASCII letters
console.log('café'.match(/\w+/g));
// ["caf"] - é is not matched by \w

General Categories

Unicode assigns every character a General Category. These are the most commonly used property values:

Letters:

// \p{Letter} or \p{L} - any letter from any script
const text = 'Hello Привет 你好 مرحبا';
console.log(text.match(/\p{Letter}+/gu));
// ["Hello", "Привет", "你好", "مرحبا"]

// Subcategories of Letter:
// \p{Lowercase_Letter} or \p{Ll} - lowercase letters
// \p{Uppercase_Letter} or \p{Lu} - uppercase letters
// \p{Titlecase_Letter} or \p{Lt} - titlecase letters (e.g., ǅ)
// \p{Modifier_Letter} or \p{Lm}  - modifier letters
// \p{Other_Letter} or \p{Lo}     - letters without case (CJK, Arabic, etc.)

console.log('Hello World'.match(/\p{Lowercase_Letter}+/gu));
// ["ello", "orld"]

console.log('Hello World'.match(/\p{Uppercase_Letter}/gu));
// ["H", "W"]

Numbers:

// \p{Number} or \p{N} - any numeric character
const mixed = 'Price: 42 or ٤٢ or ⅓';
console.log(mixed.match(/\p{Number}+/gu));
// ["42", "٤٢", "⅓"]

// Subcategories:
// \p{Decimal_Number} or \p{Nd}   - decimal digits (0-9, ٠-٩, etc.)
// \p{Letter_Number} or \p{Nl}    - letter-like numbers (Ⅰ, Ⅱ, Ⅲ, etc.)
// \p{Other_Number} or \p{No}     - other numeric (fractions, superscripts, etc.)

console.log('Test ① ② ③'.match(/\p{Other_Number}/gu));
// ["①", "②", "③"]

Punctuation:

// \p{Punctuation} or \p{P} - any punctuation
const text = 'Hello, world! How are you? (Fine.)';
console.log(text.match(/\p{Punctuation}/gu));
// [",", "!", "?", "(", ".", ")"]

// Subcategories:
// \p{Dash_Punctuation} or \p{Pd}       - dashes (-, –, —)
// \p{Open_Punctuation} or \p{Ps}       - opening brackets ((, [, {)
// \p{Close_Punctuation} or \p{Pe}      - closing brackets (), ], })
// \p{Connector_Punctuation} or \p{Pc}  - connector (_)
// \p{Other_Punctuation} or \p{Po}      - other (!, ?, #, etc.)

const dashes = 'word-hyphen or en–dash or em—dash';
console.log(dashes.match(/\p{Dash_Punctuation}/gu));
// ["-", "–", "—"]

Symbols:

// \p{Symbol} or \p{S} - any symbol
const text = 'Price: $100 + €50 = ¥20000 ™';
console.log(text.match(/\p{Symbol}/gu));
// ["$", "+", "€", "=", "¥", "™"]

// Subcategories:
// \p{Currency_Symbol} or \p{Sc} - currency symbols
// \p{Math_Symbol} or \p{Sm} - math symbols
// \p{Modifier_Symbol} or \p{Sk} - modifier symbols
// \p{Other_Symbol} or \p{So} - other symbols

console.log(text.match(/\p{Currency_Symbol}/gu));
// ["$", "€", "¥"]

console.log('2 + 3 = 5 × 10 ÷ 2'.match(/\p{Math_Symbol}/gu));
// ["+", "=", "×", "÷"]

Whitespace and Separators:

// \p{Separator} or \p{Z} - any separator
// \p{Space_Separator} or \p{Zs} - space characters
// \p{Line_Separator} or \p{Zl} - line separators
// \p{Paragraph_Separator} or \p{Zp} - paragraph separators

// Includes non-breaking space, em space, etc.
const text = 'hello\u00A0world'; // non-breaking space
console.log(text.match(/\p{Space_Separator}/gu));
// ["\u00A0"] - matches the non-breaking space

Binary Properties

Binary properties are yes/no characteristics of a character. They are used without a value:

// \p{Alphabetic} - all alphabetic characters (broader than \p{Letter})
console.log('abc123'.match(/\p{Alphabetic}+/gu)); // ["abc"]

// \p{ASCII} - ASCII characters only (U+0000 to U+007F)
console.log('Hello café 🌍'.match(/\p{ASCII}+/gu)); // ["Hello caf", " "]

// \p{Emoji} - emoji characters
const text = 'Hello 👋 World 🌍 JavaScript 🚀';
console.log(text.match(/\p{Emoji}/gu));
// ["👋", "🌍", "🚀"]
// Note: some digits and characters like # also have Emoji property

// \p{White_Space} - all whitespace characters
console.log('hello\tworld\n'.match(/\p{White_Space}/gu));
// ["\t", "\n"]

// \p{Hex_Digit} - valid hexadecimal digits
console.log('0123456789abcdefGHIJ'.match(/\p{Hex_Digit}+/gu));
// ["0123456789abcdef"]
// Note: uppercase A-F also match

// \p{ASCII_Hex_Digit} - same but limited to ASCII
console.log('0x1F600'.match(/\p{ASCII_Hex_Digit}+/gu));
// ["0", "1F600"]

Script Properties

Every character belongs to a Unicode Script (the writing system it is used in). You can match characters from specific scripts:

// \p{Script=Latin} - Latin script characters
const mixed = 'Hello Привет 你好 こんにちは مرحبا';

console.log(mixed.match(/\p{Script=Latin}+/gu));
// ["Hello"]

console.log(mixed.match(/\p{Script=Cyrillic}+/gu));
// ["Привет"]

console.log(mixed.match(/\p{Script=Han}+/gu));
// ["你好"] - Chinese (Han) characters

console.log(mixed.match(/\p{Script=Hiragana}+/gu));
// ["こんにちは"]

console.log(mixed.match(/\p{Script=Arabic}+/gu));
// ["مرحبا"]

The short form \p{sc=Latin} also works:

console.log('café'.match(/\p{sc=Latin}+/gu));
// ["café"]

Script_Extensions is a broader property that includes characters used in multiple scripts:

// Script_Extensions (scx) includes characters shared between scripts
// For example, common punctuation is shared across scripts

console.log('Ω'.match(/\p{Script=Greek}/u));           // ["Ω"]
console.log('Ω'.match(/\p{Script_Extensions=Greek}/u)); // ["Ω"]

Common Script Property Values

Value	Writing System	Example Characters
`Latin`	Latin alphabet	A, é, ñ, ü
`Cyrillic`	Russian, Ukrainian, etc.	Д, Ж, Щ
`Greek`	Greek	Ω, Σ, π
`Arabic`	Arabic, Farsi, Urdu	ع, ب, ت
`Hebrew`	Hebrew	א, ב, ג
`Han`	Chinese characters (also used in Japanese, Korean)	中, 文, 字
`Hiragana`	Japanese hiragana	あ, い, う
`Katakana`	Japanese katakana	ア, イ, ウ
`Hangul`	Korean	가, 나, 다
`Devanagari`	Hindi, Sanskrit, etc.	अ, आ, इ
`Thai`	Thai	ก, ข, ค
`Georgian`	Georgian	ა, ბ, გ
`Armenian`	Armenian	Ա, Բ, Գ
`Ethiopic`	Amharic, Tigrinya, etc.	ሀ, ለ, ሐ

The Inverse: `\P{...}`

The uppercase \P matches any character that does not have the specified property:

// \P{Letter} - anything that is NOT a letter
const text = 'Hello, World! 123';
console.log(text.match(/\P{Letter}+/gu));
// [", ", "! 123"]

// \P{Number} - anything that is NOT a number
console.log(text.match(/\P{Number}+/gu));
// ["Hello, World! "]

// \P{ASCII} - non-ASCII characters
const international = 'Hello café 世界 🌍';
console.log(international.match(/\P{ASCII}+/gu));
// ["é ", "世界 🌍"]
// Note: the spaces between are ASCII, so result includes them
// where the non-ASCII chars are adjacent

// More precise non-ASCII letters:
console.log(international.match(/\P{ASCII}/gu));
// ["é", "世", "界", "🌍"]

Practical Examples with Unicode Properties

Language-Aware Word Matching:

// \w only matches ASCII word characters
const text = 'The café serves naïve crème brûlée';

// ❌ \w breaks on accented characters
console.log(text.match(/\w+/g));
// ['The', 'caf', 'serves', 'na', 've', 'cr', 'me', 'br', 'l', 'e']

// ✅ \p{Letter} handles all Unicode letters
console.log(text.match(/[\p{Letter}\p{Mark}]+/gu));
// ['The', 'café', 'serves', 'naïve', 'crème', 'brûlée']

note

The \p{Mark} category matches combining marks (accents, diacritics) that modify the preceding letter. Including it alongside \p{Letter} ensures that characters like é (which can be composed of e + combining acute accent) are matched as part of the word.

Detecting the Script of Text:

function detectScript(text) {
  const scripts = [
    { name: 'Latin', regex: /\p{Script=Latin}/u },
    { name: 'Cyrillic', regex: /\p{Script=Cyrillic}/u },
    { name: 'Arabic', regex: /\p{Script=Arabic}/u },
    { name: 'Han', regex: /\p{Script=Han}/u },
    { name: 'Hiragana', regex: /\p{Script=Hiragana}/u },
    { name: 'Katakana', regex: /\p{Script=Katakana}/u },
    { name: 'Hangul', regex: /\p{Script=Hangul}/u },
    { name: 'Devanagari', regex: /\p{Script=Devanagari}/u },
    { name: 'Thai', regex: /\p{Script=Thai}/u },
    { name: 'Greek', regex: /\p{Script=Greek}/u }
  ];

  const detected = scripts.filter(s => s.regex.test(text)).map(s => s.name);
  return detected.length > 0 ? detected : ['Unknown'];
}

console.log(detectScript('Hello World'));    // ["Latin"]
console.log(detectScript('Привет'));         // ["Cyrillic"]
console.log(detectScript('こんにちは世界'));  // ["Han", "Hiragana"]
console.log(detectScript('Hello Привет'));   // ["Latin", "Cyrillic"]

Extracting Emoji from Text:

function extractEmoji(text) {
  // Match extended emoji sequences
  return text.match(/\p{Emoji_Presentation}/gu) || [];
}

console.log(extractEmoji('Having a great day! 😀🎉🚀'));
// ["😀", "🎉", "🚀"]

console.log(extractEmoji('No emoji here'));
// []

// Count emoji in text
function emojiCount(text) {
  return (text.match(/\p{Emoji_Presentation}/gu) || []).length;
}

console.log(emojiCount('Hello 👋 World 🌍')); // 2

warning

Emoji handling is more complex than a single property can cover. Many emoji are sequences of multiple code points (family emoji, skin tone modifiers, flag sequences). \p{Emoji_Presentation} catches individual emoji characters but may not match all compound emoji sequences. For comprehensive emoji matching, the v flag with \p{RGI_Emoji} (covered below) is more reliable.

International Username Validation:

// Allow letters from any script, digits, underscores, and hyphens
function isValidInternationalUsername(username) {
  return /^[\p{Letter}\p{Number}_-]{3,30}$/u.test(username);
}

console.log(isValidInternationalUsername('alice'));         // true
console.log(isValidInternationalUsername('用户名'));         // true (Chinese)
console.log(isValidInternationalUsername('Пользователь'));  // true (Russian)
console.log(isValidInternationalUsername('ab'));            // false (too short)
console.log(isValidInternationalUsername('user name'));     // false (space)

Removing Diacritics (Accent Marks):

function removeDiacritics(text) {
  // Normalize to NFD (decomposed form), then remove combining marks
  return text.normalize('NFD').replace(/\p{Mark}/gu, '');
}

console.log(removeDiacritics('café'));          // "cafe"
console.log(removeDiacritics('résumé'));        // "resume"
console.log(removeDiacritics('naïve'));         // "naive"
console.log(removeDiacritics('über'));          // "uber"
console.log(removeDiacritics('crème brûlée'));  // "creme brulee"

Sanitizing Input While Preserving International Characters:

// Allow letters, numbers, spaces, and basic punctuation from any language
function sanitizeInternational(input) {
  return input.replace(/[^\p{Letter}\p{Number}\p{Punctuation}\p{Space_Separator}]/gu, '');
}

console.log(sanitizeInternational('Hello, 世界! 🌍'));
// "Hello, 世界!" - emoji removed, CJK preserved

console.log(sanitizeInternational('Привет\x00мир'));
// "Приветмир" - control character removed, Cyrillic preserved

The `v` Flag: Unicode Sets (ES2024)

The v flag is a more powerful evolution of the u flag, introduced in ES2024. It is not just an incremental improvement. It adds entirely new capabilities: set operations inside character classes, properties of strings (matching multi-code-point sequences), and improved syntax consistency. The v flag is a superset of u, meaning everything that works with u also works with v, plus more.

`v` Replaces `u`

You cannot use u and v together. The v flag is intended as the successor:

// ❌ Cannot combine u and v
// /pattern/uv - SyntaxError

// ✅ Use v for new code when targeting modern environments
const regex = /\p{Letter}+/gv;

Everything from the u flag (correct surrogate pair handling, \u{XXXXX} escapes, \p{...} properties, strict escaping) works identically with v.

Set Operations in Character Classes

The most significant addition in the v flag is the ability to perform set operations inside character classes: intersection, subtraction, and union. This lets you combine or exclude character categories with precision that was previously impossible.

Intersection (&&): Match characters that belong to both sets:

// Match characters that are BOTH Greek AND letters
const greekLetters = /[\p{Script=Greek}&&\p{Letter}]/gv;

console.log('Ω π 42 Σ + ='.match(greekLetters));
// ["Ω", "π", "Σ"] - Greek symbols that are also letters

// Match characters that are BOTH ASCII AND digits
const asciiDigits = /[\p{ASCII}&&\p{Number}]/gv;
console.log('123 ٤٥٦ 789'.match(asciiDigits));
// ["1", "2", "3", "7", "8", "9"] - only ASCII digits, not Arabic

Intersection is powerful for narrowing down broad categories:

// Lowercase Latin letters only (not Cyrillic, Greek, etc.)
const lowercaseLatin = /[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv;
console.log('hello WORLD Привет café'.match(lowercaseLatin));
//  ['hello', 'café']

const text = 'helloWorldcafé';
console.log(text.match(/[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv));
// ['hello', 'orldcafé'] - uppercase W breaks the match

Subtraction (--): Match characters in the first set but not in the second:

// All letters EXCEPT ASCII letters
const nonAsciiLetters = /[\p{Letter}--\p{ASCII}]+/gv;
console.log('Hello café Привет 世界'.match(nonAsciiLetters));
// ["é", "Привет", "世界"]

// All digits EXCEPT ASCII digits (non-Western numerals)
const nonAsciiDigits = /[\p{Number}--[0-9]]+/gv;
console.log('123 ٤٥٦ ⅓ ①②③'.match(nonAsciiDigits));
// ["٤٥٦", "⅓", "①②③"]

// All whitespace EXCEPT regular space (find "unusual" whitespace)
const unusualWhitespace = /[\p{White_Space}--[ ]]/gv;
const text = 'hello\tworld\u00A0test\nend';
console.log([...text.matchAll(unusualWhitespace)].map(m => 
  `U+${m[0].codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}`
));
// ["U+0009", "U+00A0", "U+000A"] - tab, non-breaking space, newline

Nested operations: Set operations can be combined:

// Latin letters that are NOT vowels
const latinConsonants = /[[\p{Script=Latin}&&\p{Letter}]--[aeiouAEIOU]]+/gv;
console.log('Hello World'.match(latinConsonants));
// ["Hll", "Wrld"]

// Punctuation that is NOT a dash
const nonDashPunctuation = /[\p{Punctuation}--\p{Dash_Punctuation}]/gv;
console.log('hello-world! foo-bar? (test)'.match(nonDashPunctuation));
// ["!", "?", "(", ")"]

String Properties (Properties of Strings)

The v flag introduces support for Unicode properties that match multi-code-point sequences, not just individual characters. This is especially important for emoji, where many emoji are composed of multiple code points joined together.

// \p{RGI_Emoji} - Recommended for General Interchange emoji
// Matches complete emoji sequences, including compound emoji

const text = 'Hello 👋🏽 Family: 👨‍👩‍👧‍👦 Flag: 🇯🇵';

// With v: \p{RGI_Emoji} matches complete emoji sequences
const emoji = text.match(/\p{RGI_Emoji}/gv);
console.log(emoji);
// Matches complete emoji including skin tone modifiers, 
// family sequences, and flag sequences

Available string properties include:

Property	Description
`Basic_Emoji`	Basic single emoji characters
`Emoji_Keycap_Sequence`	Keycap emoji (1️⃣, 2️⃣, etc.)
`RGI_Emoji`	All recommended emoji (comprehensive)
`RGI_Emoji_Flag_Sequence`	Flag emoji (🇺🇸, 🇯🇵, etc.)
`RGI_Emoji_Modifier_Sequence`	Emoji with skin tone modifiers (👋🏽)
`RGI_Emoji_Tag_Sequence`	Tag-based sequences (🏴󠁧󠁢󠁥󠁮󠁧󠁿)
`RGI_Emoji_ZWJ_Sequence`	ZWJ sequences (👨‍👩‍👧‍👦)

// Match flag emoji specifically
const flags = '🇺🇸 🇯🇵 🇫🇷 🇩🇪 hello';
console.log(flags.match(/\p{RGI_Emoji_Flag_Sequence}/gv));
// ["🇺🇸", "🇯🇵", "🇫🇷", "🇩🇪"]

// Match emoji with skin tone modifiers
const text = '👋 👋🏻 👋🏽 👋🏿';
console.log(text.match(/\p{RGI_Emoji_Modifier_Sequence}/gv));
// ["👋🏻", "👋🏽", "👋🏿"] - only the modified versions

note

String properties can only be used inside character classes with the v flag, not standalone. You write [\p{RGI_Emoji}] rather than just \p{RGI_Emoji}. However, in practice, most uses work as expected because the character class syntax is required anyway.

Improved Character Class Syntax

The v flag also changes how character classes handle certain characters. String literals (multi-character strings) can be included in character classes:

// With v: string alternatives inside character classes
const regex = /[\q{abc|def|ghi}]/v;
// Matches "abc", "def", or "ghi" as complete strings

// This is useful for matching specific multi-character sequences
// within a broader character class context

The \q{...} syntax (where available) allows string alternatives inside character classes, though support varies. The primary use case is through Unicode string properties like \p{RGI_Emoji}.

Stricter Syntax in `v` Mode

The v flag is even stricter than u about certain syntax patterns in character classes:

// With u: some ambiguous patterns are allowed
/[a-z_]/u  // OK

// With v: character class components must be unambiguous
// Certain previously-allowed patterns require escaping

// With v: literal hyphens must be escaped or placed at boundaries
/[a\-z]/v  // Escaped hyphen: matches "a", "-", or "z"
/[-az]/v   // Hyphen at start: matches "-", "a", or "z"
/[az-]/v   // Hyphen at end: matches "a", "z", or "-"

When to Use `v` vs. `u`

// Use u when:
// - You need Unicode support
// - You need broad browser compatibility
// - You don't need set operations or string properties
const basic = /\p{Letter}+/gu;

// Use v when:
// - You need set operations (intersection, subtraction)
// - You need string properties (compound emoji matching)
// - You're targeting modern environments only
const advanced = /[\p{Letter}--\p{ASCII}]+/gv;

Browser support for v (as of 2024):

Chrome 112+
Firefox 116+
Safari 17+
Node.js 20+

If you need to support older environments, use u and work around the limitations. For new projects targeting modern browsers, v is the better choice.

Practical Example: Comprehensive Emoji Replacement

// Replace all emoji with a text placeholder
function replaceEmoji(text, replacement = '[emoji]') {
  try {
    // Preferred: v flag with RGI_Emoji catches compound emoji
    return text.replace(/\p{RGI_Emoji}/gv, replacement);
  } catch {
    // Fallback: u flag with simpler emoji matching
    return text.replace(/\p{Emoji_Presentation}/gu, replacement);
  }
}

console.log(replaceEmoji('Hello 👋🏽 World 🌍'));
// "Hello [emoji] World [emoji]"

console.log(replaceEmoji('Family: 👨‍👩‍👧‍👦'));
// "Family: [emoji]" (v flag matches the entire family as one emoji)

Practical Example: Filtering by Script While Excluding Certain Characters

// Accept only Latin letters, common punctuation, and digits
// But exclude certain symbols and control characters
function validateLatinInput(input) {
  // With v flag: set subtraction makes this clean
  const allowed = /^[\p{Script=Latin}\p{Number}\p{Space_Separator}\p{Punctuation}--[\p{Currency_Symbol}]]+$/v;
  return allowed.test(input);
}

console.log(validateLatinInput('Hello, World!'));     // true
console.log(validateLatinInput('café résumé'));       // true
console.log(validateLatinInput('Price: $100'));       // false ($ is currency)
console.log(validateLatinInput('Привет'));            // false (Cyrillic)

Practical Example: International Search

// Search that handles diacritics and case for any script
function createFlexibleSearch(searchTerm) {
  // Normalize the search term
  const normalized = searchTerm.normalize('NFD').replace(/\p{Mark}/gu, '');

  // Build a pattern that matches with or without diacritics
  let pattern = '';
  for (const char of normalized) {
    // Each character can optionally be followed by combining marks
    pattern += escapeRegExp(char) + '\\p{Mark}*';
  }

  return new RegExp(pattern, 'giu');
}

function escapeRegExp(string) {
  return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

const searchRegex = createFlexibleSearch('cafe');
const text = 'Visit the café or the CAFÉ or the Cafe!';
console.log(text.match(searchRegex));
// ["café", "CAFÉ", "Cafe"]
// Matches regardless of accents and case

Summary

Unicode support in JavaScript regular expressions transforms the regex engine from an ASCII-centric tool into one that handles the full range of human writing systems:

The u flag enables Unicode mode: surrogate pairs are treated as single characters, \u{XXXXX} code point escapes become available, \p{...} property escapes are enabled, and invalid escapes throw errors instead of being silently accepted. Use it whenever working with text that may contain non-ASCII characters.
Unicode property escapes (\p{...}) match characters by their Unicode properties: \p{Letter} for any letter, \p{Number} for any digit, \p{Script=Latin} for Latin script, \p{Emoji} for emoji, and many more. The inverse \P{...} matches characters without the property.
The v flag (ES2024) is the successor to u, adding set operations in character classes (&& for intersection, -- for subtraction) and string properties (\p{RGI_Emoji}) that match multi-code-point sequences like compound emoji, flag sequences, and skin-tone-modified emoji.
For international applications, prefer \p{Letter} over \w, \p{Number} over \d, and \p{Script=...} for script-specific matching. The traditional \w and \d classes only cover ASCII, which is insufficient for global text processing.
Use v in modern projects for the most powerful and precise Unicode matching. Fall back to u for broader browser compatibility.

The u Flag: Unicode Mode​

The Problem Without u​

Enabling Unicode Mode​

Impact on Character Counting​

Impact on Character Ranges in Sets​

Strict Escape Handling​

Unicode Code Point Escapes​

Quantifiers and Unicode Characters​

Complete Comparison: With and Without u​

Unicode Properties: \p{...} and \P{...}​

Basic Syntax​

General Categories​

Binary Properties​

Script Properties​

Common Script Property Values​

The Inverse: \P{...}​

Practical Examples with Unicode Properties​

The v Flag: Unicode Sets (ES2024)​

v Replaces u​

Set Operations in Character Classes​

String Properties (Properties of Strings)​

Improved Character Class Syntax​

Stricter Syntax in v Mode​

When to Use v vs. u​

Practical Example: Comprehensive Emoji Replacement​

Practical Example: Filtering by Script While Excluding Certain Characters​

Practical Example: International Search​

Summary​

Table of Contents