Skip to main content

How to Use Unicode in JavaScript Regular Expressions

JavaScript strings are encoded in UTF-16, which means many characters that users encounter daily, such as emoji, Chinese characters, mathematical symbols, and characters from dozens of writing systems, are represented as two code units (a surrogate pair) rather than one. Without proper Unicode handling, regular expressions treat these multi-unit characters as two separate characters, leading to broken matches, incorrect string lengths, and patterns that silently fail on international text.

The u and v flags fix this by enabling full Unicode awareness in the regex engine. Beyond correct character handling, they unlock Unicode property escapes, a powerful feature that lets you match characters by their Unicode category: letters from any script, digits from any numeral system, punctuation, emoji, and much more. This guide covers why Unicode mode matters, how to use Unicode property escapes effectively, and what the newer v flag adds with set operations and string properties.

The u Flag: Unicode Modeโ€‹

The u flag switches the JavaScript regex engine into Unicode mode. This changes several fundamental behaviors: how characters are counted, how escape sequences are interpreted, and what features are available.

The Problem Without uโ€‹

Without the u flag, JavaScript regex operates in a legacy mode where strings are treated as sequences of 16-bit code units. Characters outside the Basic Multilingual Plane (BMP), which includes all emoji, many CJK characters, musical symbols, mathematical symbols, and historic scripts, are encoded as two code units called a surrogate pair.

const emoji = '๐Ÿ˜€';

// JavaScript string sees two code units
console.log(emoji.length); // 2

// Without u: the dot matches ONE code unit (half the emoji)
console.log(emoji.match(/./));
// ["\ud83d"] - matches only the first surrogate, not the full emoji

// Without u: the regex sees two "characters"
console.log(emoji.match(/../));
// ["๐Ÿ˜€"] - needed TWO dots to match one emoji

// Without u: character class doesn't work correctly
console.log(/^.$/.test(emoji)); // false - emoji is "two characters"
console.log(/^..$/.test(emoji)); // true - two code units

This is clearly wrong. The emoji ๐Ÿ˜€ is a single character to the user, but the regex engine treats it as two.

Enabling Unicode Modeโ€‹

Adding the u flag tells the regex engine to treat the input as a sequence of Unicode code points rather than code units. Surrogate pairs are handled as single characters:

const emoji = '๐Ÿ˜€';

// With u: the dot matches the full code point
console.log(emoji.match(/./u));
// ["๐Ÿ˜€"] - correctly matches the entire emoji

// With u: one dot = one character
console.log(/^.$/u.test(emoji)); // true - emoji is one character
console.log(/^..$/u.test(emoji)); // false - it's not two characters

Impact on Character Countingโ€‹

const text = 'Hello ๐ŸŒ World';

// Without u: counts code units
console.log(text.match(/./g).length); // 14 (๐ŸŒ counted as 2)

// With u: counts code points
console.log(text.match(/./gu).length); // 13 (๐ŸŒ counted as 1)

Impact on Character Ranges in Setsโ€‹

Without u, using characters outside the BMP in character class ranges can produce errors or unexpected results:

// Without u: surrogate pairs in ranges cause issues
// This may throw or produce unexpected matches
try {
const regex = /[๏ฟฝ๏ฟฝ-๏ฟฝ๏ฟฝ]/;
// Behavior is unpredictable without u flag
} catch (e) {
console.log('Error:', e.message);
}

// With u: ranges work correctly with Unicode characters
const emojiRange = /[๏ฟฝ๏ฟฝ-๏ฟฝ๏ฟฝ]/u;
console.log(emojiRange.test('๐Ÿ˜€')); // true
console.log(emojiRange.test('๐Ÿ˜ƒ')); // true
console.log(emojiRange.test('๐Ÿ˜œ')); // true
console.log(emojiRange.test('๐Ÿ˜ก')); // false (outside range)

Strict Escape Handlingโ€‹

The u flag makes the regex engine strict about escape sequences. Invalid escapes that would be silently accepted in legacy mode throw a SyntaxError:

// Without u: \a is not a valid escape, silently treated as literal "a"
console.log(/\a/.test('a')); // true (sloppy behavior)

// With u: \a throws because it's not a recognized escape
try {
const regex = /\a/u; // SyntaxError: Invalid escape
} catch (e) {
console.log(e.message); // Invalid regular expression: /\a/: Invalid escape
}

This strictness helps catch typos and mistakes in patterns:

// Without u: these silently work but may not do what you expect
/\p/.test('p'); // true - \p is treated as literal "p"

// With u: \p without braces is an error (it expects \p{...})
// /\p/u - SyntaxError

// Without u: escaped characters that don't need escaping
/\:/; // Works (unnecessary escape, treated as literal ":")

// With u: unnecessary escapes are errors
// /\:/u - SyntaxError: Invalid escape
tip

The strict escape handling of the u flag is a feature, not a limitation. It catches mistakes that would otherwise produce subtle bugs. Always use the u flag unless you have a specific reason not to, especially when working with text that may contain non-ASCII characters.

Unicode Code Point Escapesโ€‹

The u flag enables the \u{XXXX} syntax for specifying Unicode code points by their hex value. Without u, only the four-digit \uXXXX syntax works, which cannot represent characters above U+FFFF:

// Without u: only 4-digit hex escapes work
console.log(/\u0041/.test('A')); // true (U+0041 = A)
// Cannot express characters above U+FFFF with \uXXXX

// With u: extended \u{XXXXX} syntax available
console.log(/\u{41}/u.test('A')); // true
console.log(/\u{1F600}/u.test('๐Ÿ˜€')); // true (U+1F600 = ๐Ÿ˜€)
console.log(/\u{1F30D}/u.test('๐ŸŒ')); // true (U+1F30D = ๐ŸŒ)
console.log(/\u{1F4A9}/u.test('๐Ÿ’ฉ')); // true

// Matching specific characters by code point
const checkmark = /\u{2713}/u;
console.log(checkmark.test('โœ“')); // true

// Matching a range of code points
const mathSymbols = /[\u{2200}-\u{22FF}]/u;
console.log(mathSymbols.test('โˆ€')); // true (U+2200 FOR ALL)
console.log(mathSymbols.test('โˆ‘')); // true (U+2211 N-ARY SUMMATION)
console.log(mathSymbols.test('A')); // false

Quantifiers and Unicode Charactersโ€‹

With the u flag, quantifiers correctly apply to entire Unicode characters rather than individual code units:

const text = '๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€';

// Without u: + applies to the second surrogate of the first emoji
console.log(text.match(/๐Ÿ˜€+/));
// Unpredictable behavior

// With u: + applies to the full emoji character
console.log(text.match(/๐Ÿ˜€+/u));
// ["๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€"]

// Counting emoji
console.log(text.match(/๐Ÿ˜€/gu).length); // 3

// Matching exactly 2 emoji
console.log(/^๐Ÿ˜€{2}$/u.test('๐Ÿ˜€๐Ÿ˜€')); // true
console.log(/^๐Ÿ˜€{2}$/u.test('๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€')); // false

Complete Comparison: With and Without uโ€‹

BehaviorWithout uWith u
. matches emojiNo (matches half)Yes (full character)
Surrogate pairsTreated as two charsTreated as one char
\u{XXXXX} syntaxNot availableAvailable
Invalid escapesSilently acceptedThrow SyntaxError
\p{...} propertiesNot availableAvailable
Character ranges with non-BMPBroken/unpredictableCorrect
StrictnessSloppyStrict

Unicode Properties: \p{...} and \P{...}โ€‹

Unicode property escapes are the most powerful feature unlocked by the u flag. Every character in Unicode has a set of properties that describe what it is: a letter, a digit, a punctuation mark, a symbol, which script it belongs to, and more. The \p{...} syntax lets you match characters based on these properties instead of listing specific character ranges.

Basic Syntaxโ€‹

  • \p{PropertyName} matches any character that has the specified property (or property value)
  • \P{PropertyName} matches any character that does not have the property (inverse)

Both require the u or v flag.

// Match any Unicode letter
console.log('cafรฉ'.match(/\p{Letter}+/gu));
// ["cafรฉ"] - includes the accented รฉ

// Compare with \w which misses non-ASCII letters
console.log('cafรฉ'.match(/\w+/g));
// ["caf"] - รฉ is not matched by \w

General Categoriesโ€‹

Unicode assigns every character a General Category. These are the most commonly used property values:

Letters:

// \p{Letter} or \p{L} - any letter from any script
const text = 'Hello ะŸั€ะธะฒะตั‚ ไฝ ๅฅฝ ู…ุฑุญุจุง';
console.log(text.match(/\p{Letter}+/gu));
// ["Hello", "ะŸั€ะธะฒะตั‚", "ไฝ ๅฅฝ", "ู…ุฑุญุจุง"]

// Subcategories of Letter:
// \p{Lowercase_Letter} or \p{Ll} - lowercase letters
// \p{Uppercase_Letter} or \p{Lu} - uppercase letters
// \p{Titlecase_Letter} or \p{Lt} - titlecase letters (e.g., ว…)
// \p{Modifier_Letter} or \p{Lm} - modifier letters
// \p{Other_Letter} or \p{Lo} - letters without case (CJK, Arabic, etc.)

console.log('Hello World'.match(/\p{Lowercase_Letter}+/gu));
// ["ello", "orld"]

console.log('Hello World'.match(/\p{Uppercase_Letter}/gu));
// ["H", "W"]

Numbers:

// \p{Number} or \p{N} - any numeric character
const mixed = 'Price: 42 or ูคูข or โ…“';
console.log(mixed.match(/\p{Number}+/gu));
// ["42", "ูคูข", "โ…“"]

// Subcategories:
// \p{Decimal_Number} or \p{Nd} - decimal digits (0-9, ู -ูฉ, etc.)
// \p{Letter_Number} or \p{Nl} - letter-like numbers (โ… , โ…ก, โ…ข, etc.)
// \p{Other_Number} or \p{No} - other numeric (fractions, superscripts, etc.)

console.log('Test โ‘  โ‘ก โ‘ข'.match(/\p{Other_Number}/gu));
// ["โ‘ ", "โ‘ก", "โ‘ข"]

Punctuation:

// \p{Punctuation} or \p{P} - any punctuation
const text = 'Hello, world! How are you? (Fine.)';
console.log(text.match(/\p{Punctuation}/gu));
// [",", "!", "?", "(", ".", ")"]

// Subcategories:
// \p{Dash_Punctuation} or \p{Pd} - dashes (-, โ€“, โ€”)
// \p{Open_Punctuation} or \p{Ps} - opening brackets ((, [, {)
// \p{Close_Punctuation} or \p{Pe} - closing brackets (), ], })
// \p{Connector_Punctuation} or \p{Pc} - connector (_)
// \p{Other_Punctuation} or \p{Po} - other (!, ?, #, etc.)

const dashes = 'word-hyphen or enโ€“dash or emโ€”dash';
console.log(dashes.match(/\p{Dash_Punctuation}/gu));
// ["-", "โ€“", "โ€”"]

Symbols:

// \p{Symbol} or \p{S} - any symbol
const text = 'Price: $100 + โ‚ฌ50 = ยฅ20000 โ„ข';
console.log(text.match(/\p{Symbol}/gu));
// ["$", "+", "โ‚ฌ", "=", "ยฅ", "โ„ข"]

// Subcategories:
// \p{Currency_Symbol} or \p{Sc} - currency symbols
// \p{Math_Symbol} or \p{Sm} - math symbols
// \p{Modifier_Symbol} or \p{Sk} - modifier symbols
// \p{Other_Symbol} or \p{So} - other symbols

console.log(text.match(/\p{Currency_Symbol}/gu));
// ["$", "โ‚ฌ", "ยฅ"]

console.log('2 + 3 = 5 ร— 10 รท 2'.match(/\p{Math_Symbol}/gu));
// ["+", "=", "ร—", "รท"]

Whitespace and Separators:

// \p{Separator} or \p{Z} - any separator
// \p{Space_Separator} or \p{Zs} - space characters
// \p{Line_Separator} or \p{Zl} - line separators
// \p{Paragraph_Separator} or \p{Zp} - paragraph separators

// Includes non-breaking space, em space, etc.
const text = 'hello\u00A0world'; // non-breaking space
console.log(text.match(/\p{Space_Separator}/gu));
// ["\u00A0"] - matches the non-breaking space

Binary Propertiesโ€‹

Binary properties are yes/no characteristics of a character. They are used without a value:

// \p{Alphabetic} - all alphabetic characters (broader than \p{Letter})
console.log('abc123'.match(/\p{Alphabetic}+/gu)); // ["abc"]

// \p{ASCII} - ASCII characters only (U+0000 to U+007F)
console.log('Hello cafรฉ ๐ŸŒ'.match(/\p{ASCII}+/gu)); // ["Hello caf", " "]

// \p{Emoji} - emoji characters
const text = 'Hello ๐Ÿ‘‹ World ๐ŸŒ JavaScript ๐Ÿš€';
console.log(text.match(/\p{Emoji}/gu));
// ["๐Ÿ‘‹", "๐ŸŒ", "๐Ÿš€"]
// Note: some digits and characters like # also have Emoji property

// \p{White_Space} - all whitespace characters
console.log('hello\tworld\n'.match(/\p{White_Space}/gu));
// ["\t", "\n"]

// \p{Hex_Digit} - valid hexadecimal digits
console.log('0123456789abcdefGHIJ'.match(/\p{Hex_Digit}+/gu));
// ["0123456789abcdef"]
// Note: uppercase A-F also match

// \p{ASCII_Hex_Digit} - same but limited to ASCII
console.log('0x1F600'.match(/\p{ASCII_Hex_Digit}+/gu));
// ["0", "1F600"]

Script Propertiesโ€‹

Every character belongs to a Unicode Script (the writing system it is used in). You can match characters from specific scripts:

// \p{Script=Latin} - Latin script characters
const mixed = 'Hello ะŸั€ะธะฒะตั‚ ไฝ ๅฅฝ ใ“ใ‚“ใซใกใฏ ู…ุฑุญุจุง';

console.log(mixed.match(/\p{Script=Latin}+/gu));
// ["Hello"]

console.log(mixed.match(/\p{Script=Cyrillic}+/gu));
// ["ะŸั€ะธะฒะตั‚"]

console.log(mixed.match(/\p{Script=Han}+/gu));
// ["ไฝ ๅฅฝ"] - Chinese (Han) characters

console.log(mixed.match(/\p{Script=Hiragana}+/gu));
// ["ใ“ใ‚“ใซใกใฏ"]

console.log(mixed.match(/\p{Script=Arabic}+/gu));
// ["ู…ุฑุญุจุง"]

The short form \p{sc=Latin} also works:

console.log('cafรฉ'.match(/\p{sc=Latin}+/gu));
// ["cafรฉ"]

Script_Extensions is a broader property that includes characters used in multiple scripts:

// Script_Extensions (scx) includes characters shared between scripts
// For example, common punctuation is shared across scripts

console.log('ฮฉ'.match(/\p{Script=Greek}/u)); // ["ฮฉ"]
console.log('ฮฉ'.match(/\p{Script_Extensions=Greek}/u)); // ["ฮฉ"]

Common Script Property Valuesโ€‹

ValueWriting SystemExample Characters
LatinLatin alphabetA, รฉ, รฑ, รผ
CyrillicRussian, Ukrainian, etc.ะ”, ะ–, ะฉ
GreekGreekฮฉ, ฮฃ, ฯ€
ArabicArabic, Farsi, Urduุน, ุจ, ุช
HebrewHebrewื, ื‘, ื’
HanChinese characters (also used in Japanese, Korean)ไธญ, ๆ–‡, ๅญ—
HiraganaJapanese hiraganaใ‚, ใ„, ใ†
KatakanaJapanese katakanaใ‚ข, ใ‚ค, ใ‚ฆ
HangulKorean๊ฐ€, ๋‚˜, ๋‹ค
DevanagariHindi, Sanskrit, etc.เค…, เค†, เค‡
ThaiThaiเธ, เธ‚, เธ„
GeorgianGeorgianแƒ, แƒ‘, แƒ’
ArmenianArmenianิฑ, ิฒ, ิณ
EthiopicAmharic, Tigrinya, etc.แˆ€, แˆˆ, แˆ

The Inverse: \P{...}โ€‹

The uppercase \P matches any character that does not have the specified property:

// \P{Letter} - anything that is NOT a letter
const text = 'Hello, World! 123';
console.log(text.match(/\P{Letter}+/gu));
// [", ", "! 123"]

// \P{Number} - anything that is NOT a number
console.log(text.match(/\P{Number}+/gu));
// ["Hello, World! "]

// \P{ASCII} - non-ASCII characters
const international = 'Hello cafรฉ ไธ–็•Œ ๐ŸŒ';
console.log(international.match(/\P{ASCII}+/gu));
// ["รฉ ", "ไธ–็•Œ ๐ŸŒ"]
// Note: the spaces between are ASCII, so result includes them
// where the non-ASCII chars are adjacent

// More precise non-ASCII letters:
console.log(international.match(/\P{ASCII}/gu));
// ["รฉ", "ไธ–", "็•Œ", "๐ŸŒ"]

Practical Examples with Unicode Propertiesโ€‹

Language-Aware Word Matching:

// \w only matches ASCII word characters
const text = 'The cafรฉ serves naรฏve crรจme brรปlรฉe';

// โŒ \w breaks on accented characters
console.log(text.match(/\w+/g));
// ['The', 'caf', 'serves', 'na', 've', 'cr', 'me', 'br', 'l', 'e']

// โœ… \p{Letter} handles all Unicode letters
console.log(text.match(/[\p{Letter}\p{Mark}]+/gu));
// ['The', 'cafรฉ', 'serves', 'naรฏve', 'crรจme', 'brรปlรฉe']
note

The \p{Mark} category matches combining marks (accents, diacritics) that modify the preceding letter. Including it alongside \p{Letter} ensures that characters like รฉ (which can be composed of e + combining acute accent) are matched as part of the word.

Detecting the Script of Text:

function detectScript(text) {
const scripts = [
{ name: 'Latin', regex: /\p{Script=Latin}/u },
{ name: 'Cyrillic', regex: /\p{Script=Cyrillic}/u },
{ name: 'Arabic', regex: /\p{Script=Arabic}/u },
{ name: 'Han', regex: /\p{Script=Han}/u },
{ name: 'Hiragana', regex: /\p{Script=Hiragana}/u },
{ name: 'Katakana', regex: /\p{Script=Katakana}/u },
{ name: 'Hangul', regex: /\p{Script=Hangul}/u },
{ name: 'Devanagari', regex: /\p{Script=Devanagari}/u },
{ name: 'Thai', regex: /\p{Script=Thai}/u },
{ name: 'Greek', regex: /\p{Script=Greek}/u }
];

const detected = scripts.filter(s => s.regex.test(text)).map(s => s.name);
return detected.length > 0 ? detected : ['Unknown'];
}

console.log(detectScript('Hello World')); // ["Latin"]
console.log(detectScript('ะŸั€ะธะฒะตั‚')); // ["Cyrillic"]
console.log(detectScript('ใ“ใ‚“ใซใกใฏไธ–็•Œ')); // ["Han", "Hiragana"]
console.log(detectScript('Hello ะŸั€ะธะฒะตั‚')); // ["Latin", "Cyrillic"]

Extracting Emoji from Text:

function extractEmoji(text) {
// Match extended emoji sequences
return text.match(/\p{Emoji_Presentation}/gu) || [];
}

console.log(extractEmoji('Having a great day! ๐Ÿ˜€๐ŸŽ‰๐Ÿš€'));
// ["๐Ÿ˜€", "๐ŸŽ‰", "๐Ÿš€"]

console.log(extractEmoji('No emoji here'));
// []

// Count emoji in text
function emojiCount(text) {
return (text.match(/\p{Emoji_Presentation}/gu) || []).length;
}

console.log(emojiCount('Hello ๐Ÿ‘‹ World ๐ŸŒ')); // 2
warning

Emoji handling is more complex than a single property can cover. Many emoji are sequences of multiple code points (family emoji, skin tone modifiers, flag sequences). \p{Emoji_Presentation} catches individual emoji characters but may not match all compound emoji sequences. For comprehensive emoji matching, the v flag with \p{RGI_Emoji} (covered below) is more reliable.

International Username Validation:

// Allow letters from any script, digits, underscores, and hyphens
function isValidInternationalUsername(username) {
return /^[\p{Letter}\p{Number}_-]{3,30}$/u.test(username);
}

console.log(isValidInternationalUsername('alice')); // true
console.log(isValidInternationalUsername('็”จๆˆทๅ')); // true (Chinese)
console.log(isValidInternationalUsername('ะŸะพะปัŒะทะพะฒะฐั‚ะตะปัŒ')); // true (Russian)
console.log(isValidInternationalUsername('ab')); // false (too short)
console.log(isValidInternationalUsername('user name')); // false (space)

Removing Diacritics (Accent Marks):

function removeDiacritics(text) {
// Normalize to NFD (decomposed form), then remove combining marks
return text.normalize('NFD').replace(/\p{Mark}/gu, '');
}

console.log(removeDiacritics('cafรฉ')); // "cafe"
console.log(removeDiacritics('rรฉsumรฉ')); // "resume"
console.log(removeDiacritics('naรฏve')); // "naive"
console.log(removeDiacritics('รผber')); // "uber"
console.log(removeDiacritics('crรจme brรปlรฉe')); // "creme brulee"

Sanitizing Input While Preserving International Characters:

// Allow letters, numbers, spaces, and basic punctuation from any language
function sanitizeInternational(input) {
return input.replace(/[^\p{Letter}\p{Number}\p{Punctuation}\p{Space_Separator}]/gu, '');
}

console.log(sanitizeInternational('Hello, ไธ–็•Œ! ๐ŸŒ'));
// "Hello, ไธ–็•Œ!" - emoji removed, CJK preserved

console.log(sanitizeInternational('ะŸั€ะธะฒะตั‚\x00ะผะธั€'));
// "ะŸั€ะธะฒะตั‚ะผะธั€" - control character removed, Cyrillic preserved

The v Flag: Unicode Sets (ES2024)โ€‹

The v flag is a more powerful evolution of the u flag, introduced in ES2024. It is not just an incremental improvement. It adds entirely new capabilities: set operations inside character classes, properties of strings (matching multi-code-point sequences), and improved syntax consistency. The v flag is a superset of u, meaning everything that works with u also works with v, plus more.

v Replaces uโ€‹

You cannot use u and v together. The v flag is intended as the successor:

// โŒ Cannot combine u and v
// /pattern/uv - SyntaxError

// โœ… Use v for new code when targeting modern environments
const regex = /\p{Letter}+/gv;

Everything from the u flag (correct surrogate pair handling, \u{XXXXX} escapes, \p{...} properties, strict escaping) works identically with v.

Set Operations in Character Classesโ€‹

The most significant addition in the v flag is the ability to perform set operations inside character classes: intersection, subtraction, and union. This lets you combine or exclude character categories with precision that was previously impossible.

Intersection (&&): Match characters that belong to both sets:

// Match characters that are BOTH Greek AND letters
const greekLetters = /[\p{Script=Greek}&&\p{Letter}]/gv;

console.log('ฮฉ ฯ€ 42 ฮฃ + ='.match(greekLetters));
// ["ฮฉ", "ฯ€", "ฮฃ"] - Greek symbols that are also letters

// Match characters that are BOTH ASCII AND digits
const asciiDigits = /[\p{ASCII}&&\p{Number}]/gv;
console.log('123 ูคูฅูฆ 789'.match(asciiDigits));
// ["1", "2", "3", "7", "8", "9"] - only ASCII digits, not Arabic

Intersection is powerful for narrowing down broad categories:

// Lowercase Latin letters only (not Cyrillic, Greek, etc.)
const lowercaseLatin = /[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv;
console.log('hello WORLD ะŸั€ะธะฒะตั‚ cafรฉ'.match(lowercaseLatin));
// ย ['hello', 'cafรฉ']

const text = 'helloWorldcafรฉ';
console.log(text.match(/[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv));
// ['hello', 'orldcafรฉ'] - uppercase W breaks the match

Subtraction (--): Match characters in the first set but not in the second:

// All letters EXCEPT ASCII letters
const nonAsciiLetters = /[\p{Letter}--\p{ASCII}]+/gv;
console.log('Hello cafรฉ ะŸั€ะธะฒะตั‚ ไธ–็•Œ'.match(nonAsciiLetters));
// ["รฉ", "ะŸั€ะธะฒะตั‚", "ไธ–็•Œ"]

// All digits EXCEPT ASCII digits (non-Western numerals)
const nonAsciiDigits = /[\p{Number}--[0-9]]+/gv;
console.log('123 ูคูฅูฆ โ…“ โ‘ โ‘กโ‘ข'.match(nonAsciiDigits));
// ["ูคูฅูฆ", "โ…“", "โ‘ โ‘กโ‘ข"]

// All whitespace EXCEPT regular space (find "unusual" whitespace)
const unusualWhitespace = /[\p{White_Space}--[ ]]/gv;
const text = 'hello\tworld\u00A0test\nend';
console.log([...text.matchAll(unusualWhitespace)].map(m =>
`U+${m[0].codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}`
));
// ["U+0009", "U+00A0", "U+000A"] - tab, non-breaking space, newline

Nested operations: Set operations can be combined:

// Latin letters that are NOT vowels
const latinConsonants = /[[\p{Script=Latin}&&\p{Letter}]--[aeiouAEIOU]]+/gv;
console.log('Hello World'.match(latinConsonants));
// ["Hll", "Wrld"]

// Punctuation that is NOT a dash
const nonDashPunctuation = /[\p{Punctuation}--\p{Dash_Punctuation}]/gv;
console.log('hello-world! foo-bar? (test)'.match(nonDashPunctuation));
// ["!", "?", "(", ")"]

String Properties (Properties of Strings)โ€‹

The v flag introduces support for Unicode properties that match multi-code-point sequences, not just individual characters. This is especially important for emoji, where many emoji are composed of multiple code points joined together.

// \p{RGI_Emoji} - Recommended for General Interchange emoji
// Matches complete emoji sequences, including compound emoji

const text = 'Hello ๐Ÿ‘‹๐Ÿฝ Family: ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ Flag: ๐Ÿ‡ฏ๐Ÿ‡ต';

// With v: \p{RGI_Emoji} matches complete emoji sequences
const emoji = text.match(/\p{RGI_Emoji}/gv);
console.log(emoji);
// Matches complete emoji including skin tone modifiers,
// family sequences, and flag sequences

Available string properties include:

PropertyDescription
Basic_EmojiBasic single emoji characters
Emoji_Keycap_SequenceKeycap emoji (1๏ธโƒฃ, 2๏ธโƒฃ, etc.)
RGI_EmojiAll recommended emoji (comprehensive)
RGI_Emoji_Flag_SequenceFlag emoji (๐Ÿ‡บ๐Ÿ‡ธ, ๐Ÿ‡ฏ๐Ÿ‡ต, etc.)
RGI_Emoji_Modifier_SequenceEmoji with skin tone modifiers (๐Ÿ‘‹๐Ÿฝ)
RGI_Emoji_Tag_SequenceTag-based sequences (๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ)
RGI_Emoji_ZWJ_SequenceZWJ sequences (๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ)
// Match flag emoji specifically
const flags = '๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ‡ฏ๐Ÿ‡ต ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ฉ๐Ÿ‡ช hello';
console.log(flags.match(/\p{RGI_Emoji_Flag_Sequence}/gv));
// ["๐Ÿ‡บ๐Ÿ‡ธ", "๐Ÿ‡ฏ๐Ÿ‡ต", "๐Ÿ‡ซ๐Ÿ‡ท", "๐Ÿ‡ฉ๐Ÿ‡ช"]

// Match emoji with skin tone modifiers
const text = '๐Ÿ‘‹ ๐Ÿ‘‹๐Ÿป ๐Ÿ‘‹๐Ÿฝ ๐Ÿ‘‹๐Ÿฟ';
console.log(text.match(/\p{RGI_Emoji_Modifier_Sequence}/gv));
// ["๐Ÿ‘‹๐Ÿป", "๐Ÿ‘‹๐Ÿฝ", "๐Ÿ‘‹๐Ÿฟ"] - only the modified versions
note

String properties can only be used inside character classes with the v flag, not standalone. You write [\p{RGI_Emoji}] rather than just \p{RGI_Emoji}. However, in practice, most uses work as expected because the character class syntax is required anyway.

Improved Character Class Syntaxโ€‹

The v flag also changes how character classes handle certain characters. String literals (multi-character strings) can be included in character classes:

// With v: string alternatives inside character classes
const regex = /[\q{abc|def|ghi}]/v;
// Matches "abc", "def", or "ghi" as complete strings

// This is useful for matching specific multi-character sequences
// within a broader character class context

The \q{...} syntax (where available) allows string alternatives inside character classes, though support varies. The primary use case is through Unicode string properties like \p{RGI_Emoji}.

Stricter Syntax in v Modeโ€‹

The v flag is even stricter than u about certain syntax patterns in character classes:

// With u: some ambiguous patterns are allowed
/[a-z_]/u // OK

// With v: character class components must be unambiguous
// Certain previously-allowed patterns require escaping

// With v: literal hyphens must be escaped or placed at boundaries
/[a\-z]/v // Escaped hyphen: matches "a", "-", or "z"
/[-az]/v // Hyphen at start: matches "-", "a", or "z"
/[az-]/v // Hyphen at end: matches "a", "z", or "-"

When to Use v vs. uโ€‹

// Use u when:
// - You need Unicode support
// - You need broad browser compatibility
// - You don't need set operations or string properties
const basic = /\p{Letter}+/gu;

// Use v when:
// - You need set operations (intersection, subtraction)
// - You need string properties (compound emoji matching)
// - You're targeting modern environments only
const advanced = /[\p{Letter}--\p{ASCII}]+/gv;

Browser support for v (as of 2024):

  • Chrome 112+
  • Firefox 116+
  • Safari 17+
  • Node.js 20+

If you need to support older environments, use u and work around the limitations. For new projects targeting modern browsers, v is the better choice.

Practical Example: Comprehensive Emoji Replacementโ€‹

// Replace all emoji with a text placeholder
function replaceEmoji(text, replacement = '[emoji]') {
try {
// Preferred: v flag with RGI_Emoji catches compound emoji
return text.replace(/\p{RGI_Emoji}/gv, replacement);
} catch {
// Fallback: u flag with simpler emoji matching
return text.replace(/\p{Emoji_Presentation}/gu, replacement);
}
}

console.log(replaceEmoji('Hello ๐Ÿ‘‹๐Ÿฝ World ๐ŸŒ'));
// "Hello [emoji] World [emoji]"

console.log(replaceEmoji('Family: ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ'));
// "Family: [emoji]" (v flag matches the entire family as one emoji)

Practical Example: Filtering by Script While Excluding Certain Charactersโ€‹

// Accept only Latin letters, common punctuation, and digits
// But exclude certain symbols and control characters
function validateLatinInput(input) {
// With v flag: set subtraction makes this clean
const allowed = /^[\p{Script=Latin}\p{Number}\p{Space_Separator}\p{Punctuation}--[\p{Currency_Symbol}]]+$/v;
return allowed.test(input);
}

console.log(validateLatinInput('Hello, World!')); // true
console.log(validateLatinInput('cafรฉ rรฉsumรฉ')); // true
console.log(validateLatinInput('Price: $100')); // false ($ is currency)
console.log(validateLatinInput('ะŸั€ะธะฒะตั‚')); // false (Cyrillic)
// Search that handles diacritics and case for any script
function createFlexibleSearch(searchTerm) {
// Normalize the search term
const normalized = searchTerm.normalize('NFD').replace(/\p{Mark}/gu, '');

// Build a pattern that matches with or without diacritics
let pattern = '';
for (const char of normalized) {
// Each character can optionally be followed by combining marks
pattern += escapeRegExp(char) + '\\p{Mark}*';
}

return new RegExp(pattern, 'giu');
}

function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

const searchRegex = createFlexibleSearch('cafe');
const text = 'Visit the cafรฉ or the CAFร‰ or the Cafe!';
console.log(text.match(searchRegex));
// ["cafรฉ", "CAFร‰", "Cafe"]
// Matches regardless of accents and case

Summaryโ€‹

Unicode support in JavaScript regular expressions transforms the regex engine from an ASCII-centric tool into one that handles the full range of human writing systems:

  • The u flag enables Unicode mode: surrogate pairs are treated as single characters, \u{XXXXX} code point escapes become available, \p{...} property escapes are enabled, and invalid escapes throw errors instead of being silently accepted. Use it whenever working with text that may contain non-ASCII characters.
  • Unicode property escapes (\p{...}) match characters by their Unicode properties: \p{Letter} for any letter, \p{Number} for any digit, \p{Script=Latin} for Latin script, \p{Emoji} for emoji, and many more. The inverse \P{...} matches characters without the property.
  • The v flag (ES2024) is the successor to u, adding set operations in character classes (&& for intersection, -- for subtraction) and string properties (\p{RGI_Emoji}) that match multi-code-point sequences like compound emoji, flag sequences, and skin-tone-modified emoji.
  • For international applications, prefer \p{Letter} over \w, \p{Number} over \d, and \p{Script=...} for script-specific matching. The traditional \w and \d classes only cover ASCII, which is insufficient for global text processing.
  • Use v in modern projects for the most powerful and precise Unicode matching. Fall back to u for broader browser compatibility.