How to Use Unicode in JavaScript Regular Expressions
JavaScript strings are encoded in UTF-16, which means many characters that users encounter daily, such as emoji, Chinese characters, mathematical symbols, and characters from dozens of writing systems, are represented as two code units (a surrogate pair) rather than one. Without proper Unicode handling, regular expressions treat these multi-unit characters as two separate characters, leading to broken matches, incorrect string lengths, and patterns that silently fail on international text.
The u and v flags fix this by enabling full Unicode awareness in the regex engine. Beyond correct character handling, they unlock Unicode property escapes, a powerful feature that lets you match characters by their Unicode category: letters from any script, digits from any numeral system, punctuation, emoji, and much more. This guide covers why Unicode mode matters, how to use Unicode property escapes effectively, and what the newer v flag adds with set operations and string properties.
The u Flag: Unicode Modeโ
The u flag switches the JavaScript regex engine into Unicode mode. This changes several fundamental behaviors: how characters are counted, how escape sequences are interpreted, and what features are available.
The Problem Without uโ
Without the u flag, JavaScript regex operates in a legacy mode where strings are treated as sequences of 16-bit code units. Characters outside the Basic Multilingual Plane (BMP), which includes all emoji, many CJK characters, musical symbols, mathematical symbols, and historic scripts, are encoded as two code units called a surrogate pair.
const emoji = '๐';
// JavaScript string sees two code units
console.log(emoji.length); // 2
// Without u: the dot matches ONE code unit (half the emoji)
console.log(emoji.match(/./));
// ["\ud83d"] - matches only the first surrogate, not the full emoji
// Without u: the regex sees two "characters"
console.log(emoji.match(/../));
// ["๐"] - needed TWO dots to match one emoji
// Without u: character class doesn't work correctly
console.log(/^.$/.test(emoji)); // false - emoji is "two characters"
console.log(/^..$/.test(emoji)); // true - two code units
This is clearly wrong. The emoji ๐ is a single character to the user, but the regex engine treats it as two.
Enabling Unicode Modeโ
Adding the u flag tells the regex engine to treat the input as a sequence of Unicode code points rather than code units. Surrogate pairs are handled as single characters:
const emoji = '๐';
// With u: the dot matches the full code point
console.log(emoji.match(/./u));
// ["๐"] - correctly matches the entire emoji
// With u: one dot = one character
console.log(/^.$/u.test(emoji)); // true - emoji is one character
console.log(/^..$/u.test(emoji)); // false - it's not two characters
Impact on Character Countingโ
const text = 'Hello ๐ World';
// Without u: counts code units
console.log(text.match(/./g).length); // 14 (๐ counted as 2)
// With u: counts code points
console.log(text.match(/./gu).length); // 13 (๐ counted as 1)
Impact on Character Ranges in Setsโ
Without u, using characters outside the BMP in character class ranges can produce errors or unexpected results:
// Without u: surrogate pairs in ranges cause issues
// This may throw or produce unexpected matches
try {
const regex = /[๏ฟฝ๏ฟฝ-๏ฟฝ๏ฟฝ]/;
// Behavior is unpredictable without u flag
} catch (e) {
console.log('Error:', e.message);
}
// With u: ranges work correctly with Unicode characters
const emojiRange = /[๏ฟฝ๏ฟฝ-๏ฟฝ๏ฟฝ]/u;
console.log(emojiRange.test('๐')); // true
console.log(emojiRange.test('๐')); // true
console.log(emojiRange.test('๐')); // true
console.log(emojiRange.test('๐ก')); // false (outside range)
Strict Escape Handlingโ
The u flag makes the regex engine strict about escape sequences. Invalid escapes that would be silently accepted in legacy mode throw a SyntaxError:
// Without u: \a is not a valid escape, silently treated as literal "a"
console.log(/\a/.test('a')); // true (sloppy behavior)
// With u: \a throws because it's not a recognized escape
try {
const regex = /\a/u; // SyntaxError: Invalid escape
} catch (e) {
console.log(e.message); // Invalid regular expression: /\a/: Invalid escape
}
This strictness helps catch typos and mistakes in patterns:
// Without u: these silently work but may not do what you expect
/\p/.test('p'); // true - \p is treated as literal "p"
// With u: \p without braces is an error (it expects \p{...})
// /\p/u - SyntaxError
// Without u: escaped characters that don't need escaping
/\:/; // Works (unnecessary escape, treated as literal ":")
// With u: unnecessary escapes are errors
// /\:/u - SyntaxError: Invalid escape
The strict escape handling of the u flag is a feature, not a limitation. It catches mistakes that would otherwise produce subtle bugs. Always use the u flag unless you have a specific reason not to, especially when working with text that may contain non-ASCII characters.
Unicode Code Point Escapesโ
The u flag enables the \u{XXXX} syntax for specifying Unicode code points by their hex value. Without u, only the four-digit \uXXXX syntax works, which cannot represent characters above U+FFFF:
// Without u: only 4-digit hex escapes work
console.log(/\u0041/.test('A')); // true (U+0041 = A)
// Cannot express characters above U+FFFF with \uXXXX
// With u: extended \u{XXXXX} syntax available
console.log(/\u{41}/u.test('A')); // true
console.log(/\u{1F600}/u.test('๐')); // true (U+1F600 = ๐)
console.log(/\u{1F30D}/u.test('๐')); // true (U+1F30D = ๐)
console.log(/\u{1F4A9}/u.test('๐ฉ')); // true
// Matching specific characters by code point
const checkmark = /\u{2713}/u;
console.log(checkmark.test('โ')); // true
// Matching a range of code points
const mathSymbols = /[\u{2200}-\u{22FF}]/u;
console.log(mathSymbols.test('โ')); // true (U+2200 FOR ALL)
console.log(mathSymbols.test('โ')); // true (U+2211 N-ARY SUMMATION)
console.log(mathSymbols.test('A')); // false
Quantifiers and Unicode Charactersโ
With the u flag, quantifiers correctly apply to entire Unicode characters rather than individual code units:
const text = '๐๐๐';
// Without u: + applies to the second surrogate of the first emoji
console.log(text.match(/๐+/));
// Unpredictable behavior
// With u: + applies to the full emoji character
console.log(text.match(/๐+/u));
// ["๐๐๐"]
// Counting emoji
console.log(text.match(/๐/gu).length); // 3
// Matching exactly 2 emoji
console.log(/^๐{2}$/u.test('๐๐')); // true
console.log(/^๐{2}$/u.test('๐๐๐')); // false
Complete Comparison: With and Without uโ
| Behavior | Without u | With u |
|---|---|---|
. matches emoji | No (matches half) | Yes (full character) |
| Surrogate pairs | Treated as two chars | Treated as one char |
\u{XXXXX} syntax | Not available | Available |
| Invalid escapes | Silently accepted | Throw SyntaxError |
\p{...} properties | Not available | Available |
| Character ranges with non-BMP | Broken/unpredictable | Correct |
| Strictness | Sloppy | Strict |
Unicode Properties: \p{...} and \P{...}โ
Unicode property escapes are the most powerful feature unlocked by the u flag. Every character in Unicode has a set of properties that describe what it is: a letter, a digit, a punctuation mark, a symbol, which script it belongs to, and more. The \p{...} syntax lets you match characters based on these properties instead of listing specific character ranges.
Basic Syntaxโ
\p{PropertyName}matches any character that has the specified property (or property value)\P{PropertyName}matches any character that does not have the property (inverse)
Both require the u or v flag.
// Match any Unicode letter
console.log('cafรฉ'.match(/\p{Letter}+/gu));
// ["cafรฉ"] - includes the accented รฉ
// Compare with \w which misses non-ASCII letters
console.log('cafรฉ'.match(/\w+/g));
// ["caf"] - รฉ is not matched by \w
General Categoriesโ
Unicode assigns every character a General Category. These are the most commonly used property values:
Letters:
// \p{Letter} or \p{L} - any letter from any script
const text = 'Hello ะัะธะฒะตั ไฝ ๅฅฝ ู
ุฑุญุจุง';
console.log(text.match(/\p{Letter}+/gu));
// ["Hello", "ะัะธะฒะตั", "ไฝ ๅฅฝ", "ู
ุฑุญุจุง"]
// Subcategories of Letter:
// \p{Lowercase_Letter} or \p{Ll} - lowercase letters
// \p{Uppercase_Letter} or \p{Lu} - uppercase letters
// \p{Titlecase_Letter} or \p{Lt} - titlecase letters (e.g., ว
)
// \p{Modifier_Letter} or \p{Lm} - modifier letters
// \p{Other_Letter} or \p{Lo} - letters without case (CJK, Arabic, etc.)
console.log('Hello World'.match(/\p{Lowercase_Letter}+/gu));
// ["ello", "orld"]
console.log('Hello World'.match(/\p{Uppercase_Letter}/gu));
// ["H", "W"]
Numbers:
// \p{Number} or \p{N} - any numeric character
const mixed = 'Price: 42 or ูคูข or โ
';
console.log(mixed.match(/\p{Number}+/gu));
// ["42", "ูคูข", "โ
"]
// Subcategories:
// \p{Decimal_Number} or \p{Nd} - decimal digits (0-9, ู -ูฉ, etc.)
// \p{Letter_Number} or \p{Nl} - letter-like numbers (โ
, โ
ก, โ
ข, etc.)
// \p{Other_Number} or \p{No} - other numeric (fractions, superscripts, etc.)
console.log('Test โ โก โข'.match(/\p{Other_Number}/gu));
// ["โ ", "โก", "โข"]
Punctuation:
// \p{Punctuation} or \p{P} - any punctuation
const text = 'Hello, world! How are you? (Fine.)';
console.log(text.match(/\p{Punctuation}/gu));
// [",", "!", "?", "(", ".", ")"]
// Subcategories:
// \p{Dash_Punctuation} or \p{Pd} - dashes (-, โ, โ)
// \p{Open_Punctuation} or \p{Ps} - opening brackets ((, [, {)
// \p{Close_Punctuation} or \p{Pe} - closing brackets (), ], })
// \p{Connector_Punctuation} or \p{Pc} - connector (_)
// \p{Other_Punctuation} or \p{Po} - other (!, ?, #, etc.)
const dashes = 'word-hyphen or enโdash or emโdash';
console.log(dashes.match(/\p{Dash_Punctuation}/gu));
// ["-", "โ", "โ"]
Symbols:
// \p{Symbol} or \p{S} - any symbol
const text = 'Price: $100 + โฌ50 = ยฅ20000 โข';
console.log(text.match(/\p{Symbol}/gu));
// ["$", "+", "โฌ", "=", "ยฅ", "โข"]
// Subcategories:
// \p{Currency_Symbol} or \p{Sc} - currency symbols
// \p{Math_Symbol} or \p{Sm} - math symbols
// \p{Modifier_Symbol} or \p{Sk} - modifier symbols
// \p{Other_Symbol} or \p{So} - other symbols
console.log(text.match(/\p{Currency_Symbol}/gu));
// ["$", "โฌ", "ยฅ"]
console.log('2 + 3 = 5 ร 10 รท 2'.match(/\p{Math_Symbol}/gu));
// ["+", "=", "ร", "รท"]
Whitespace and Separators:
// \p{Separator} or \p{Z} - any separator
// \p{Space_Separator} or \p{Zs} - space characters
// \p{Line_Separator} or \p{Zl} - line separators
// \p{Paragraph_Separator} or \p{Zp} - paragraph separators
// Includes non-breaking space, em space, etc.
const text = 'hello\u00A0world'; // non-breaking space
console.log(text.match(/\p{Space_Separator}/gu));
// ["\u00A0"] - matches the non-breaking space
Binary Propertiesโ
Binary properties are yes/no characteristics of a character. They are used without a value:
// \p{Alphabetic} - all alphabetic characters (broader than \p{Letter})
console.log('abc123'.match(/\p{Alphabetic}+/gu)); // ["abc"]
// \p{ASCII} - ASCII characters only (U+0000 to U+007F)
console.log('Hello cafรฉ ๐'.match(/\p{ASCII}+/gu)); // ["Hello caf", " "]
// \p{Emoji} - emoji characters
const text = 'Hello ๐ World ๐ JavaScript ๐';
console.log(text.match(/\p{Emoji}/gu));
// ["๐", "๐", "๐"]
// Note: some digits and characters like # also have Emoji property
// \p{White_Space} - all whitespace characters
console.log('hello\tworld\n'.match(/\p{White_Space}/gu));
// ["\t", "\n"]
// \p{Hex_Digit} - valid hexadecimal digits
console.log('0123456789abcdefGHIJ'.match(/\p{Hex_Digit}+/gu));
// ["0123456789abcdef"]
// Note: uppercase A-F also match
// \p{ASCII_Hex_Digit} - same but limited to ASCII
console.log('0x1F600'.match(/\p{ASCII_Hex_Digit}+/gu));
// ["0", "1F600"]
Script Propertiesโ
Every character belongs to a Unicode Script (the writing system it is used in). You can match characters from specific scripts:
// \p{Script=Latin} - Latin script characters
const mixed = 'Hello ะัะธะฒะตั ไฝ ๅฅฝ ใใใซใกใฏ ู
ุฑุญุจุง';
console.log(mixed.match(/\p{Script=Latin}+/gu));
// ["Hello"]
console.log(mixed.match(/\p{Script=Cyrillic}+/gu));
// ["ะัะธะฒะตั"]
console.log(mixed.match(/\p{Script=Han}+/gu));
// ["ไฝ ๅฅฝ"] - Chinese (Han) characters
console.log(mixed.match(/\p{Script=Hiragana}+/gu));
// ["ใใใซใกใฏ"]
console.log(mixed.match(/\p{Script=Arabic}+/gu));
// ["ู
ุฑุญุจุง"]
The short form \p{sc=Latin} also works:
console.log('cafรฉ'.match(/\p{sc=Latin}+/gu));
// ["cafรฉ"]
Script_Extensions is a broader property that includes characters used in multiple scripts:
// Script_Extensions (scx) includes characters shared between scripts
// For example, common punctuation is shared across scripts
console.log('ฮฉ'.match(/\p{Script=Greek}/u)); // ["ฮฉ"]
console.log('ฮฉ'.match(/\p{Script_Extensions=Greek}/u)); // ["ฮฉ"]
Common Script Property Valuesโ
| Value | Writing System | Example Characters |
|---|---|---|
Latin | Latin alphabet | A, รฉ, รฑ, รผ |
Cyrillic | Russian, Ukrainian, etc. | ะ, ะ, ะฉ |
Greek | Greek | ฮฉ, ฮฃ, ฯ |
Arabic | Arabic, Farsi, Urdu | ุน, ุจ, ุช |
Hebrew | Hebrew | ื, ื, ื |
Han | Chinese characters (also used in Japanese, Korean) | ไธญ, ๆ, ๅญ |
Hiragana | Japanese hiragana | ใ, ใ, ใ |
Katakana | Japanese katakana | ใข, ใค, ใฆ |
Hangul | Korean | ๊ฐ, ๋, ๋ค |
Devanagari | Hindi, Sanskrit, etc. | เค , เค, เค |
Thai | Thai | เธ, เธ, เธ |
Georgian | Georgian | แ, แ, แ |
Armenian | Armenian | ิฑ, ิฒ, ิณ |
Ethiopic | Amharic, Tigrinya, etc. | แ, แ, แ |
The Inverse: \P{...}โ
The uppercase \P matches any character that does not have the specified property:
// \P{Letter} - anything that is NOT a letter
const text = 'Hello, World! 123';
console.log(text.match(/\P{Letter}+/gu));
// [", ", "! 123"]
// \P{Number} - anything that is NOT a number
console.log(text.match(/\P{Number}+/gu));
// ["Hello, World! "]
// \P{ASCII} - non-ASCII characters
const international = 'Hello cafรฉ ไธ็ ๐';
console.log(international.match(/\P{ASCII}+/gu));
// ["รฉ ", "ไธ็ ๐"]
// Note: the spaces between are ASCII, so result includes them
// where the non-ASCII chars are adjacent
// More precise non-ASCII letters:
console.log(international.match(/\P{ASCII}/gu));
// ["รฉ", "ไธ", "็", "๐"]
Practical Examples with Unicode Propertiesโ
Language-Aware Word Matching:
// \w only matches ASCII word characters
const text = 'The cafรฉ serves naรฏve crรจme brรปlรฉe';
// โ \w breaks on accented characters
console.log(text.match(/\w+/g));
// ['The', 'caf', 'serves', 'na', 've', 'cr', 'me', 'br', 'l', 'e']
// โ
\p{Letter} handles all Unicode letters
console.log(text.match(/[\p{Letter}\p{Mark}]+/gu));
// ['The', 'cafรฉ', 'serves', 'naรฏve', 'crรจme', 'brรปlรฉe']
The \p{Mark} category matches combining marks (accents, diacritics) that modify the preceding letter. Including it alongside \p{Letter} ensures that characters like รฉ (which can be composed of e + combining acute accent) are matched as part of the word.
Detecting the Script of Text:
function detectScript(text) {
const scripts = [
{ name: 'Latin', regex: /\p{Script=Latin}/u },
{ name: 'Cyrillic', regex: /\p{Script=Cyrillic}/u },
{ name: 'Arabic', regex: /\p{Script=Arabic}/u },
{ name: 'Han', regex: /\p{Script=Han}/u },
{ name: 'Hiragana', regex: /\p{Script=Hiragana}/u },
{ name: 'Katakana', regex: /\p{Script=Katakana}/u },
{ name: 'Hangul', regex: /\p{Script=Hangul}/u },
{ name: 'Devanagari', regex: /\p{Script=Devanagari}/u },
{ name: 'Thai', regex: /\p{Script=Thai}/u },
{ name: 'Greek', regex: /\p{Script=Greek}/u }
];
const detected = scripts.filter(s => s.regex.test(text)).map(s => s.name);
return detected.length > 0 ? detected : ['Unknown'];
}
console.log(detectScript('Hello World')); // ["Latin"]
console.log(detectScript('ะัะธะฒะตั')); // ["Cyrillic"]
console.log(detectScript('ใใใซใกใฏไธ็')); // ["Han", "Hiragana"]
console.log(detectScript('Hello ะัะธะฒะตั')); // ["Latin", "Cyrillic"]
Extracting Emoji from Text:
function extractEmoji(text) {
// Match extended emoji sequences
return text.match(/\p{Emoji_Presentation}/gu) || [];
}
console.log(extractEmoji('Having a great day! ๐๐๐'));
// ["๐", "๐", "๐"]
console.log(extractEmoji('No emoji here'));
// []
// Count emoji in text
function emojiCount(text) {
return (text.match(/\p{Emoji_Presentation}/gu) || []).length;
}
console.log(emojiCount('Hello ๐ World ๐')); // 2
Emoji handling is more complex than a single property can cover. Many emoji are sequences of multiple code points (family emoji, skin tone modifiers, flag sequences). \p{Emoji_Presentation} catches individual emoji characters but may not match all compound emoji sequences. For comprehensive emoji matching, the v flag with \p{RGI_Emoji} (covered below) is more reliable.
International Username Validation:
// Allow letters from any script, digits, underscores, and hyphens
function isValidInternationalUsername(username) {
return /^[\p{Letter}\p{Number}_-]{3,30}$/u.test(username);
}
console.log(isValidInternationalUsername('alice')); // true
console.log(isValidInternationalUsername('็จๆทๅ')); // true (Chinese)
console.log(isValidInternationalUsername('ะะพะปัะทะพะฒะฐัะตะปั')); // true (Russian)
console.log(isValidInternationalUsername('ab')); // false (too short)
console.log(isValidInternationalUsername('user name')); // false (space)
Removing Diacritics (Accent Marks):
function removeDiacritics(text) {
// Normalize to NFD (decomposed form), then remove combining marks
return text.normalize('NFD').replace(/\p{Mark}/gu, '');
}
console.log(removeDiacritics('cafรฉ')); // "cafe"
console.log(removeDiacritics('rรฉsumรฉ')); // "resume"
console.log(removeDiacritics('naรฏve')); // "naive"
console.log(removeDiacritics('รผber')); // "uber"
console.log(removeDiacritics('crรจme brรปlรฉe')); // "creme brulee"
Sanitizing Input While Preserving International Characters:
// Allow letters, numbers, spaces, and basic punctuation from any language
function sanitizeInternational(input) {
return input.replace(/[^\p{Letter}\p{Number}\p{Punctuation}\p{Space_Separator}]/gu, '');
}
console.log(sanitizeInternational('Hello, ไธ็! ๐'));
// "Hello, ไธ็!" - emoji removed, CJK preserved
console.log(sanitizeInternational('ะัะธะฒะตั\x00ะผะธั'));
// "ะัะธะฒะตัะผะธั" - control character removed, Cyrillic preserved
The v Flag: Unicode Sets (ES2024)โ
The v flag is a more powerful evolution of the u flag, introduced in ES2024. It is not just an incremental improvement. It adds entirely new capabilities: set operations inside character classes, properties of strings (matching multi-code-point sequences), and improved syntax consistency. The v flag is a superset of u, meaning everything that works with u also works with v, plus more.
v Replaces uโ
You cannot use u and v together. The v flag is intended as the successor:
// โ Cannot combine u and v
// /pattern/uv - SyntaxError
// โ
Use v for new code when targeting modern environments
const regex = /\p{Letter}+/gv;
Everything from the u flag (correct surrogate pair handling, \u{XXXXX} escapes, \p{...} properties, strict escaping) works identically with v.
Set Operations in Character Classesโ
The most significant addition in the v flag is the ability to perform set operations inside character classes: intersection, subtraction, and union. This lets you combine or exclude character categories with precision that was previously impossible.
Intersection (&&): Match characters that belong to both sets:
// Match characters that are BOTH Greek AND letters
const greekLetters = /[\p{Script=Greek}&&\p{Letter}]/gv;
console.log('ฮฉ ฯ 42 ฮฃ + ='.match(greekLetters));
// ["ฮฉ", "ฯ", "ฮฃ"] - Greek symbols that are also letters
// Match characters that are BOTH ASCII AND digits
const asciiDigits = /[\p{ASCII}&&\p{Number}]/gv;
console.log('123 ูคูฅูฆ 789'.match(asciiDigits));
// ["1", "2", "3", "7", "8", "9"] - only ASCII digits, not Arabic
Intersection is powerful for narrowing down broad categories:
// Lowercase Latin letters only (not Cyrillic, Greek, etc.)
const lowercaseLatin = /[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv;
console.log('hello WORLD ะัะธะฒะตั cafรฉ'.match(lowercaseLatin));
// ย ['hello', 'cafรฉ']
const text = 'helloWorldcafรฉ';
console.log(text.match(/[\p{Lowercase_Letter}&&\p{Script=Latin}]+/gv));
// ['hello', 'orldcafรฉ'] - uppercase W breaks the match
Subtraction (--): Match characters in the first set but not in the second:
// All letters EXCEPT ASCII letters
const nonAsciiLetters = /[\p{Letter}--\p{ASCII}]+/gv;
console.log('Hello cafรฉ ะัะธะฒะตั ไธ็'.match(nonAsciiLetters));
// ["รฉ", "ะัะธะฒะตั", "ไธ็"]
// All digits EXCEPT ASCII digits (non-Western numerals)
const nonAsciiDigits = /[\p{Number}--[0-9]]+/gv;
console.log('123 ูคูฅูฆ โ
โ โกโข'.match(nonAsciiDigits));
// ["ูคูฅูฆ", "โ
", "โ โกโข"]
// All whitespace EXCEPT regular space (find "unusual" whitespace)
const unusualWhitespace = /[\p{White_Space}--[ ]]/gv;
const text = 'hello\tworld\u00A0test\nend';
console.log([...text.matchAll(unusualWhitespace)].map(m =>
`U+${m[0].codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}`
));
// ["U+0009", "U+00A0", "U+000A"] - tab, non-breaking space, newline
Nested operations: Set operations can be combined:
// Latin letters that are NOT vowels
const latinConsonants = /[[\p{Script=Latin}&&\p{Letter}]--[aeiouAEIOU]]+/gv;
console.log('Hello World'.match(latinConsonants));
// ["Hll", "Wrld"]
// Punctuation that is NOT a dash
const nonDashPunctuation = /[\p{Punctuation}--\p{Dash_Punctuation}]/gv;
console.log('hello-world! foo-bar? (test)'.match(nonDashPunctuation));
// ["!", "?", "(", ")"]
String Properties (Properties of Strings)โ
The v flag introduces support for Unicode properties that match multi-code-point sequences, not just individual characters. This is especially important for emoji, where many emoji are composed of multiple code points joined together.
// \p{RGI_Emoji} - Recommended for General Interchange emoji
// Matches complete emoji sequences, including compound emoji
const text = 'Hello ๐๐ฝ Family: ๐จโ๐ฉโ๐งโ๐ฆ Flag: ๐ฏ๐ต';
// With v: \p{RGI_Emoji} matches complete emoji sequences
const emoji = text.match(/\p{RGI_Emoji}/gv);
console.log(emoji);
// Matches complete emoji including skin tone modifiers,
// family sequences, and flag sequences
Available string properties include:
| Property | Description |
|---|---|
Basic_Emoji | Basic single emoji characters |
Emoji_Keycap_Sequence | Keycap emoji (1๏ธโฃ, 2๏ธโฃ, etc.) |
RGI_Emoji | All recommended emoji (comprehensive) |
RGI_Emoji_Flag_Sequence | Flag emoji (๐บ๐ธ, ๐ฏ๐ต, etc.) |
RGI_Emoji_Modifier_Sequence | Emoji with skin tone modifiers (๐๐ฝ) |
RGI_Emoji_Tag_Sequence | Tag-based sequences (๐ด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ) |
RGI_Emoji_ZWJ_Sequence | ZWJ sequences (๐จโ๐ฉโ๐งโ๐ฆ) |
// Match flag emoji specifically
const flags = '๐บ๐ธ ๐ฏ๐ต ๐ซ๐ท ๐ฉ๐ช hello';
console.log(flags.match(/\p{RGI_Emoji_Flag_Sequence}/gv));
// ["๐บ๐ธ", "๐ฏ๐ต", "๐ซ๐ท", "๐ฉ๐ช"]
// Match emoji with skin tone modifiers
const text = '๐ ๐๐ป ๐๐ฝ ๐๐ฟ';
console.log(text.match(/\p{RGI_Emoji_Modifier_Sequence}/gv));
// ["๐๐ป", "๐๐ฝ", "๐๐ฟ"] - only the modified versions
String properties can only be used inside character classes with the v flag, not standalone. You write [\p{RGI_Emoji}] rather than just \p{RGI_Emoji}. However, in practice, most uses work as expected because the character class syntax is required anyway.
Improved Character Class Syntaxโ
The v flag also changes how character classes handle certain characters. String literals (multi-character strings) can be included in character classes:
// With v: string alternatives inside character classes
const regex = /[\q{abc|def|ghi}]/v;
// Matches "abc", "def", or "ghi" as complete strings
// This is useful for matching specific multi-character sequences
// within a broader character class context
The \q{...} syntax (where available) allows string alternatives inside character classes, though support varies. The primary use case is through Unicode string properties like \p{RGI_Emoji}.
Stricter Syntax in v Modeโ
The v flag is even stricter than u about certain syntax patterns in character classes:
// With u: some ambiguous patterns are allowed
/[a-z_]/u // OK
// With v: character class components must be unambiguous
// Certain previously-allowed patterns require escaping
// With v: literal hyphens must be escaped or placed at boundaries
/[a\-z]/v // Escaped hyphen: matches "a", "-", or "z"
/[-az]/v // Hyphen at start: matches "-", "a", or "z"
/[az-]/v // Hyphen at end: matches "a", "z", or "-"
When to Use v vs. uโ
// Use u when:
// - You need Unicode support
// - You need broad browser compatibility
// - You don't need set operations or string properties
const basic = /\p{Letter}+/gu;
// Use v when:
// - You need set operations (intersection, subtraction)
// - You need string properties (compound emoji matching)
// - You're targeting modern environments only
const advanced = /[\p{Letter}--\p{ASCII}]+/gv;
Browser support for v (as of 2024):
- Chrome 112+
- Firefox 116+
- Safari 17+
- Node.js 20+
If you need to support older environments, use u and work around the limitations. For new projects targeting modern browsers, v is the better choice.
Practical Example: Comprehensive Emoji Replacementโ
// Replace all emoji with a text placeholder
function replaceEmoji(text, replacement = '[emoji]') {
try {
// Preferred: v flag with RGI_Emoji catches compound emoji
return text.replace(/\p{RGI_Emoji}/gv, replacement);
} catch {
// Fallback: u flag with simpler emoji matching
return text.replace(/\p{Emoji_Presentation}/gu, replacement);
}
}
console.log(replaceEmoji('Hello ๐๐ฝ World ๐'));
// "Hello [emoji] World [emoji]"
console.log(replaceEmoji('Family: ๐จโ๐ฉโ๐งโ๐ฆ'));
// "Family: [emoji]" (v flag matches the entire family as one emoji)
Practical Example: Filtering by Script While Excluding Certain Charactersโ
// Accept only Latin letters, common punctuation, and digits
// But exclude certain symbols and control characters
function validateLatinInput(input) {
// With v flag: set subtraction makes this clean
const allowed = /^[\p{Script=Latin}\p{Number}\p{Space_Separator}\p{Punctuation}--[\p{Currency_Symbol}]]+$/v;
return allowed.test(input);
}
console.log(validateLatinInput('Hello, World!')); // true
console.log(validateLatinInput('cafรฉ rรฉsumรฉ')); // true
console.log(validateLatinInput('Price: $100')); // false ($ is currency)
console.log(validateLatinInput('ะัะธะฒะตั')); // false (Cyrillic)
Practical Example: International Searchโ
// Search that handles diacritics and case for any script
function createFlexibleSearch(searchTerm) {
// Normalize the search term
const normalized = searchTerm.normalize('NFD').replace(/\p{Mark}/gu, '');
// Build a pattern that matches with or without diacritics
let pattern = '';
for (const char of normalized) {
// Each character can optionally be followed by combining marks
pattern += escapeRegExp(char) + '\\p{Mark}*';
}
return new RegExp(pattern, 'giu');
}
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}
const searchRegex = createFlexibleSearch('cafe');
const text = 'Visit the cafรฉ or the CAFร or the Cafe!';
console.log(text.match(searchRegex));
// ["cafรฉ", "CAFร", "Cafe"]
// Matches regardless of accents and case
Summaryโ
Unicode support in JavaScript regular expressions transforms the regex engine from an ASCII-centric tool into one that handles the full range of human writing systems:
- The
uflag enables Unicode mode: surrogate pairs are treated as single characters,\u{XXXXX}code point escapes become available,\p{...}property escapes are enabled, and invalid escapes throw errors instead of being silently accepted. Use it whenever working with text that may contain non-ASCII characters. - Unicode property escapes (
\p{...}) match characters by their Unicode properties:\p{Letter}for any letter,\p{Number}for any digit,\p{Script=Latin}for Latin script,\p{Emoji}for emoji, and many more. The inverse\P{...}matches characters without the property. - The
vflag (ES2024) is the successor tou, adding set operations in character classes (&&for intersection,--for subtraction) and string properties (\p{RGI_Emoji}) that match multi-code-point sequences like compound emoji, flag sequences, and skin-tone-modified emoji. - For international applications, prefer
\p{Letter}over\w,\p{Number}over\d, and\p{Script=...}for script-specific matching. The traditional\wand\dclasses only cover ASCII, which is insufficient for global text processing. - Use
vin modern projects for the most powerful and precise Unicode matching. Fall back toufor broader browser compatibility.