Unicode and String Internals in JavaScript
When you write "hello".length, JavaScript tells you 5. Intuitive and correct. But when you write "😀".length, JavaScript tells you 2. And "é".length might return 1 or 2 depending on how the character was encoded. Understanding why requires diving into how JavaScript actually stores and processes strings internally.
JavaScript strings are sequences of UTF-16 code units, not characters in the way humans think of them. This design decision, made in the mid-1990s when Unicode was much smaller, creates real-world problems with emoji, accented characters, and scripts from many languages. This guide explains how Unicode works, how JavaScript represents it internally, and the tools available to handle strings correctly in the modern, emoji-filled world.
Unicode Basics: Code Points, UTF-16, Surrogate Pairs
What Is Unicode?
Unicode is a universal standard that assigns a unique number to every character in every writing system. These numbers are called code points. Unicode currently defines over 149,000 characters covering 161 scripts.
A code point is written as U+ followed by a hexadecimal number:
U+0041 → A (Latin capital letter A)
U+0042 → B
U+00E9 → é (Latin small letter e with acute)
U+4E16 → 世 (Chinese character "world")
U+1F600 → 😀 (Grinning face emoji)
U+1F1FA U+1F1F8 → 🇺🇸 (US flag - two code points!)
Code points range from U+0000 to U+10FFFF, organized into 17 "planes" of 65,536 code points each:
- Plane 0 (U+0000 to U+FFFF): Basic Multilingual Plane (BMP). Contains most common characters: Latin, Greek, Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, and many symbols.
- Planes 1-16 (U+10000 to U+10FFFF): Supplementary planes. Contains emoji, historic scripts, musical notation, mathematical symbols, and rare CJK characters.
What Is UTF-16?
UTF-16 is an encoding that transforms Unicode code points into a sequence of 16-bit values called code units. JavaScript strings are internally stored as sequences of UTF-16 code units.
For code points in the BMP (U+0000 to U+FFFF), each code point maps directly to a single 16-bit code unit:
U+0041 (A) → one code unit: 0x0041
U+00E9 (é) → one code unit: 0x00E9
U+4E16 (世) → one code unit: 0x4E16
For code points above U+FFFF (supplementary planes), a single 16-bit value is not enough. These code points are encoded as two code units called a surrogate pair:
U+1F600 (😀) → two code units: 0xD83D 0xDE00
U+1F4A9 (💩) → two code units: 0xD83D 0xDCA9
U+1D11E (𝄞) → two code units: 0xD834 0xDD1E
How Surrogate Pairs Work
The BMP reserves a range of code points specifically for surrogate pairs:
- High surrogates: U+D800 to U+DBFF (1024 values)
- Low surrogates: U+DC00 to U+DFFF (1024 values)
A high surrogate followed by a low surrogate encodes one supplementary code point. The formula:
codePoint = (highSurrogate - 0xD800) × 0x400 + (lowSurrogate - 0xDC00) + 0x10000
// 😀 is U+1F600
const high = 0xD83D;
const low = 0xDE00;
const codePoint = (high - 0xD800) * 0x400 + (low - 0xDC00) + 0x10000;
console.log(codePoint.toString(16)); // "1f600"
console.log(String.fromCodePoint(codePoint)); // "😀"
Why This Matters for JavaScript
Every string method in JavaScript operates on code units, not code points or visual characters:
// BMP characters: one code unit each
console.log("A".length); // 1 (one code unit)
console.log("é".length); // 1 (one code unit (if using the precomposed form))
console.log("世".length); // 1 (one code unit)
// Supplementary characters: TWO code units (surrogate pair)
console.log("😀".length); // 2 (two code units!)
console.log("💩".length); // 2
console.log("𝄞".length); // 2 (musical symbol G clef)
// Indexing operates on code units too
console.log("😀"[0]); // "\uD83D" (the high surrogate, not a valid character by itself!)
console.log("😀"[1]); // "\uDE00" (the low surrogate, not a valid character either!)
// This means string operations can break emoji:
const text = "Hello 😀 World";
console.log(text.length); // 13 (not 12!)
console.log(text.slice(0, 7)); // "Hello 😀" (works by accident, surrogate pair is complete)
console.log(text.slice(0, 6)); // "Hello " (before emoji start)
console.log(text.slice(6, 7)); // "\uD83D" (BROKEN! Half of the emoji)
String.fromCodePoint() and codePointAt()
ES2015 introduced methods that work with full Unicode code points instead of individual code units.
String.fromCodePoint()
Creates a string from one or more Unicode code points:
// BMP characters
console.log(String.fromCodePoint(65)); // "A" (U+0041)
console.log(String.fromCodePoint(233)); // "é" (U+00E9)
console.log(String.fromCodePoint(0x4E16)); // "世"
// Supplementary characters (above U+FFFF)
console.log(String.fromCodePoint(0x1F600)); // "😀"
console.log(String.fromCodePoint(0x1F4A9)); // "💩"
console.log(String.fromCodePoint(0x1D11E)); // "𝄞"
// Multiple code points at once
console.log(String.fromCodePoint(72, 101, 108, 108, 111)); // "Hello"
console.log(String.fromCodePoint(0x1F600, 0x1F601, 0x1F602)); // "😀😁😂"
The Older String.fromCharCode() (Limited)
The older String.fromCharCode() works with code units, not code points. It cannot handle supplementary characters directly:
// fromCharCode works for BMP
console.log(String.fromCharCode(65)); // "A"
console.log(String.fromCharCode(0x4E16)); // "世"
// fromCharCode FAILS for supplementary characters
console.log(String.fromCharCode(0x1F600)); // "" (wrong! Truncated to 16 bits)
// You would need to manually calculate the surrogate pair:
console.log(String.fromCharCode(0xD83D, 0xDE00)); // "😀" (works but awkward)
// fromCodePoint handles this automatically:
console.log(String.fromCodePoint(0x1F600)); // "😀" (clean and correct)
String.prototype.codePointAt()
Returns the full Unicode code point at a given code unit position:
// BMP character
console.log("A".codePointAt(0)); // 65 (U+0041)
console.log("é".codePointAt(0)); // 233 (U+00E9)
// Supplementary character
console.log("😀".codePointAt(0)); // 128512 (U+1F600, the full code point!)
// Compare with the older charCodeAt (which returns code UNITS):
console.log("😀".charCodeAt(0)); // 55357 (0xD83D, just the high surrogate)
console.log("😀".charCodeAt(1)); // 56832 (0xDE00, just the low surrogate)
// codePointAt at position 1 returns the low surrogate's code point
console.log("😀".codePointAt(1)); // 56832 (0xDE00, the low surrogate)
The important caveat: codePointAt() takes a code unit index, not a code point index. For supplementary characters that occupy two code units, calling codePointAt(1) returns the low surrogate, not the next character.
Iterating Over Code Points Correctly
The for...of loop iterates over code points, not code units. This is the easiest way to handle Unicode correctly:
const text = "Hello 😀 World 🌍";
// WRONG: for loop iterates over code units
for (let i = 0; i < text.length; i++) {
// text[i] might be half of a surrogate pair
}
// CORRECT: for...of iterates over code points
for (const char of text) {
console.log(char, char.codePointAt(0).toString(16));
}
// H 48
// e 65
// l 6c
// l 6c
// o 6f
// 20
// 😀 1f600 ← one iteration for the full emoji
// 20
// W 57
// o 6f
// r 72
// l 6c
// d 64
// 20
// 🌍 1f30d ← one iteration for the full emoji
// Spread operator also works with code points
const chars = [...text];
console.log(chars.length); // 15 (not 17!)
console.log(chars[6]); // "😀" (the complete emoji, not half of it)
Code Point-Aware String Length
function codePointLength(str) {
return [...str].length;
}
console.log("Hello".length); // 5
console.log(codePointLength("Hello")); // 5
console.log("Hello 😀".length); // 8 (counts code units)
console.log(codePointLength("Hello 😀")); // 7 (counts code points)
// Using Array.from
console.log(Array.from("😀😁😂").length); // 3 (correct!)
console.log("😀😁😂".length); // 6 (code units, doubled)
Unicode Normalization: normalize()
A single visual character can sometimes be represented by different sequences of code points. For example, the letter "é" can be:
- Precomposed (NFC): A single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Decomposed (NFD): Two code points: U+0065 (LATIN SMALL LETTER E) + U+0301 (COMBINING ACUTE ACCENT)
Both look identical on screen but are different at the byte level.
The Problem
const precomposed = "\u00E9"; // é (one code point)
const decomposed = "\u0065\u0301"; // é (two code points: e + combining accent)
// They LOOK identical
console.log(precomposed); // é
console.log(decomposed); // é
// But they are NOT equal
console.log(precomposed === decomposed); // false!
console.log(precomposed.length); // 1
console.log(decomposed.length); // 2
// This breaks searches, comparisons, and sorting
const text = "caf\u00E9"; // "café" with precomposed é
console.log(text.includes("caf\u0065\u0301")); // false! Even though both are "café"
normalize() to the Rescue
The normalize() method converts a string to a standard normalization form:
const precomposed = "\u00E9";
const decomposed = "\u0065\u0301";
// NFC: Canonical Decomposition, followed by Canonical Composition
// → Converts to the shortest (precomposed) form
console.log(precomposed.normalize("NFC") === decomposed.normalize("NFC")); // true
console.log(decomposed.normalize("NFC").length); // 1
// NFD: Canonical Decomposition
// → Converts to the longest (decomposed) form
console.log(precomposed.normalize("NFD") === decomposed.normalize("NFD")); // true
console.log(precomposed.normalize("NFD").length); // 2
// Default is NFC
console.log(decomposed.normalize().length); // 1
console.log(decomposed.normalize() === precomposed); // true
Normalization Forms
| Form | Name | Description | Use Case |
|---|---|---|---|
| NFC | Canonical Composition | Composes characters where possible | Default for most use: comparison, storage |
| NFD | Canonical Decomposition | Decomposes into base + combining marks | Text processing, accent removal |
| NFKC | Compatibility Composition | Like NFC but also normalizes compatibility chars | Search, identifier comparison |
| NFKD | Compatibility Decomposition | Like NFD but also normalizes compatibility chars | Full-text search, folding |
Practical Examples
// Safe string comparison
function unicodeEqual(a, b) {
return a.normalize("NFC") === b.normalize("NFC");
}
console.log(unicodeEqual("café", "cafe\u0301")); // true
// Removing accents (using NFD decomposition)
function removeAccents(str) {
return str
.normalize("NFD")
.replace(/[\u0300-\u036f]/g, ""); // Remove combining diacritical marks
}
console.log(removeAccents("café")); // "cafe"
console.log(removeAccents("naïve")); // "naive"
console.log(removeAccents("résumé")); // "resume"
console.log(removeAccents("Ñoño")); // "Nono"
console.log(removeAccents("über")); // "uber"
// Compatibility normalization
console.log("fi".normalize("NFKC")); // "fi" (ligature → separate letters)
console.log("⑤".normalize("NFKC")); // "5" (circled digit → plain digit)
console.log("Ⅳ".normalize("NFKC")); // "IV" (Roman numeral → letters)
When comparing user-supplied strings (usernames, search queries, form input), always normalize first. Users may input the same visual character in different ways depending on their keyboard, OS, or input method. string.normalize("NFC") before comparison prevents invisible mismatches.
\u{...} Escape Syntax
ES2015 introduced the \u{...} escape syntax, which can represent any Unicode code point directly, including supplementary characters.
The Old Syntax: \uXXXX (Limited to BMP)
The traditional \uXXXX escape handles only 4 hex digits, limiting it to the BMP (U+0000 to U+FFFF):
console.log("\u0041"); // "A" (U+0041)
console.log("\u00E9"); // "é" (U+00E9)
console.log("\u4E16"); // "世" (U+4E16)
// Cannot represent supplementary characters with a single escape:
console.log("\u1F600"); // "" (wrong! Interpreted as U+1F60 + "0")
// You need a surrogate pair:
console.log("\uD83D\uDE00"); // "😀" (correct but ugly)
The New Syntax: \u{XXXXX} (Any Code Point)
The curly brace syntax accepts any number of hex digits from 1 to 6:
// BMP characters (same result as \uXXXX)
console.log("\u{41}"); // "A"
console.log("\u{E9}"); // "é"
console.log("\u{4E16}"); // "世"
// Supplementary characters (impossible with \uXXXX alone)
console.log("\u{1F600}"); // "😀"
console.log("\u{1F4A9}"); // "💩"
console.log("\u{1D11E}"); // "𝄞"
console.log("\u{10FFFF}"); // The maximum code point
// Much cleaner than surrogate pairs
// Old way: "\uD83D\uDE00"
// New way: "\u{1F600}"
Using in Regular Expressions
The \u{...} syntax works in regular expressions when the u (unicode) flag is set:
// Without u flag: \u{...} doesn't work
// /\u{1F600}/.test("😀"); // false (parsed incorrectly)
// With u flag: \u{...} works correctly
console.log(/\u{1F600}/u.test("😀")); // true
// Match any emoji in a range
const emojiPattern = /[\u{1F600}-\u{1F64F}]/u;
console.log(emojiPattern.test("😀")); // true
console.log(emojiPattern.test("🙏")); // true
console.log(emojiPattern.test("A")); // false
// The u flag also makes . match supplementary characters
console.log(/^.$/u.test("😀")); // true (with u flag, . matches full code point)
console.log(/^.$/.test("😀")); // false (without u flag, . matches one code unit)
Handling Emoji and Multi-Code-Point Characters
Emoji present unique challenges because many emoji consist of multiple code points combined together.
Single Code Point Emoji
The simplest emoji are single code points from the supplementary planes:
console.log("😀".length); // 2 (one code point, two code units)
console.log([..."😀"].length); // 1 (one code point)
console.log("😀".codePointAt(0)); // 128512 (U+1F600)
Emoji with Variation Selectors
Some characters can be rendered as text or as emoji, controlled by a variation selector:
// U+2764 is "Heavy Black Heart"
console.log("\u2764"); // ❤ (text presentation)
console.log("\u2764\uFE0F"); // ❤️ (emoji presentation, with variation selector VS16)
console.log("\u2764".length); // 1
console.log("\u2764\uFE0F".length); // 2 (base + variation selector)
Emoji Modifiers (Skin Tones)
Skin tone modifiers are additional code points (U+1F3FB to U+1F3FF) that follow a base emoji:
console.log("👋"); // Default yellow hand
console.log("👋🏻"); // Light skin tone
console.log("👋🏽"); // Medium skin tone
console.log("👋🏿"); // Dark skin tone
console.log("👋".length); // 2 (one supplementary code point)
console.log("👋🏽".length); // 4 (base + modifier, each is a supplementary code point)
console.log([..."👋🏽"]); // ["👋", "🏽"] (two code points, but ONE visual character!)
Zero-Width Joiner (ZWJ) Sequences
Some emoji are composed by joining multiple emoji with a Zero-Width Joiner (U+200D):
// Family emoji: composed of multiple people joined by ZWJ
const family = "👨👩👧👦"; // Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy
console.log(family.length); // 11 code units!
console.log([...family].length); // 7 code points!
// But visually it is ONE character (one glyph)
// Breaking down the family emoji:
const parts = [...family];
console.log(parts);
// ["👨", "", "👩", "", "👧", "", "👦"]
// Man ZWJ Woman ZWJ Girl ZWJ Boy
// More ZWJ examples:
console.log("👩💻".length); // 5 (Woman + ZWJ + Laptop (Woman Technologist))
console.log("🏳️🌈".length); // 6 (White Flag + VS16 + ZWJ + Rainbow (Pride Flag))
console.log("👩❤️👨".length); // 8 (Couple with Heart)
Flag Emoji (Regional Indicator Sequences)
Country flags are composed of two Regional Indicator characters:
// US flag: Regional Indicator U + Regional Indicator S
const usFlag = "🇺🇸";
console.log(usFlag.length); // 4 code units
console.log([...usFlag].length); // 2 code points
// Two code points, but ONE visual character
// Each regional indicator is a supplementary character
console.log("🇺".codePointAt(0).toString(16)); // 1f1fa (Regional Indicator U)
console.log("🇸".codePointAt(0).toString(16)); // 1f1f8 (Regional Indicator S)
The Core Problem: What Is a "Character"?
These examples reveal that the concept of "character" has multiple levels:
| Level | Name | "👨👩👧👦" | Method to Count |
|---|---|---|---|
| Code units | UTF-16 units | 11 | .length |
| Code points | Unicode code points | 7 | [...str].length |
| Grapheme clusters | Visual characters | 1 | Intl.Segmenter |
.length counts code units. Spreading with [...] gives code points. But neither gives you what humans consider "characters." For that, you need grapheme cluster segmentation.
Grapheme Clusters and Intl.Segmenter
A grapheme cluster is what a human perceives as a single character. It may consist of one or many code points. The Intl.Segmenter API (supported in all modern browsers) can segment strings by grapheme clusters, words, or sentences.
Basic Usage
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const text = "Hello 😀 World 👨👩👧👦 🇺🇸";
const segments = [...segmenter.segment(text)];
console.log(segments.map(s => s.segment));
// ["H", "e", "l", "l", "o", " ", "😀", " ", "W", "o", "r", "l", "d", " ", "👨👩👧👦", " ", "🇺🇸"]
console.log(segments.length); // 17 visual characters
console.log(text.length); // 28 code units, very different!
Accurate String Length
function graphemeLength(str) {
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
return [...segmenter.segment(str)].length;
}
console.log(graphemeLength("Hello")); // 5
console.log(graphemeLength("😀😁😂")); // 3
console.log(graphemeLength("👨👩👧👦")); // 1 (one family emoji)
console.log(graphemeLength("🇺🇸🇬🇧🇫🇷")); // 3 (three flags)
console.log(graphemeLength("café")); // 4 (regardless of normalization form)
console.log(graphemeLength("👋🏽")); // 1 (hand with skin tone = one grapheme)
console.log(graphemeLength("नमस्ते")); // 4 (Hindi "Namaste", fewer graphemes than code points)
Safe String Slicing
function graphemeSlice(str, start, end) {
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const segments = [...segmenter.segment(str)];
return segments.slice(start, end).map(s => s.segment).join("");
}
const text = "Hello 😀 World 👨👩👧👦";
// Safe slicing that respects grapheme boundaries
console.log(graphemeSlice(text, 0, 7)); // "Hello 😀" (emoji is intact)
console.log(graphemeSlice(text, 6, 7)); // "😀" (complete emoji)
console.log(graphemeSlice(text, -1)); // "👨👩👧👦" (complete family emoji)
// Compare with naive slicing
console.log(text.slice(6, 8)); // "😀" (works by coincidence)
console.log(text.slice(6, 7)); // "\uD83D" (BROKEN! Half an emoji)
Safe String Reversal
Reversing a string naively breaks surrogate pairs and grapheme clusters:
// BROKEN: Naive reversal
function naiveReverse(str) {
return str.split("").reverse().join("");
}
console.log(naiveReverse("Hello 😀"));
// "😀 olleH" (appears to work...)
// Actually: "\uDE00\uD83D olleH" (surrogates are REVERSED (broken))
// BETTER: Code-point-aware reversal
function codePointReverse(str) {
return [...str].reverse().join("");
}
console.log(codePointReverse("Hello 😀")); // "😀 olleH" (emoji is intact)
// But still breaks multi-code-point characters!
console.log(codePointReverse("👨👩👧👦")); // "👦👧👩👨" (reversed family members!)
console.log(codePointReverse("🇺🇸")); // "🇸🇺" (SU flag instead of US!)
// CORRECT: Grapheme-aware reversal
function graphemeReverse(str) {
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
return [...segmenter.segment(str)].map(s => s.segment).reverse().join("");
}
console.log(graphemeReverse("Hello 😀")); // "😀 olleH" (correct)
console.log(graphemeReverse("👨👩👧👦 hi")); // "ih 👨👩👧👦" (family emoji intact)
console.log(graphemeReverse("🇺🇸🇬🇧")); // "🇬🇧🇺🇸" (flags intact and correctly reversed)
Word and Sentence Segmentation
Intl.Segmenter also handles word and sentence boundaries, which vary by language:
// Word segmentation
const wordSegmenter = new Intl.Segmenter("en", { granularity: "word" });
const words = [...wordSegmenter.segment("Hello, world! How are you?")];
const actualWords = words.filter(s => s.isWordLike).map(s => s.segment);
console.log(actualWords); // ["Hello", "world", "How", "are", "you"]
// Sentence segmentation
const sentenceSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const sentences = [...sentenceSegmenter.segment("Hello! How are you? I'm fine.")];
console.log(sentences.map(s => s.segment));
// ["Hello! ", "How are you? ", "I'm fine."]
Truncating User Input Safely
A practical example combining everything: truncating a user's display name to a maximum number of visual characters:
function truncateGraphemes(str, maxGraphemes, suffix = "...") {
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const segments = [...segmenter.segment(str)];
if (segments.length <= maxGraphemes) {
return str;
}
const truncated = segments
.slice(0, maxGraphemes)
.map(s => s.segment)
.join("");
return truncated + suffix;
}
console.log(truncateGraphemes("Hello World", 5)); // "Hello..."
console.log(truncateGraphemes("Hello 😀🌍🎉", 7)); // "Hello 😀..."
console.log(truncateGraphemes("👨👩👧👦👨👩👧👦👨👩👧👦", 2)); // "👨👩👧👦👨👩👧👦..."
console.log(truncateGraphemes("Hi", 5)); // "Hi" (no truncation needed)
Regular Expressions and Unicode
The u and v flags enable proper Unicode handling in regular expressions:
// Without u flag: . matches one code UNIT
console.log(/^.$/.test("😀")); // false (😀 is two code units)
// With u flag: . matches one code POINT
console.log(/^.$/u.test("😀")); // true
// Unicode property escapes (requires u or v flag)
// Match any letter from any script
console.log(/\p{Letter}/u.test("A")); // true
console.log(/\p{Letter}/u.test("é")); // true
console.log(/\p{Letter}/u.test("世")); // true
console.log(/\p{Letter}/u.test("😀")); // false (emoji is not a letter)
console.log(/\p{Letter}/u.test("5")); // false (digit is not a letter)
// Match emoji
console.log(/\p{Emoji}/u.test("😀")); // true
console.log(/\p{Emoji}/u.test("A")); // false
// Match specific scripts
console.log(/\p{Script=Greek}/u.test("Ω")); // true
console.log(/\p{Script=Cyrillic}/u.test("Д")); // true
console.log(/\p{Script=Han}/u.test("世")); // true
Summary
| Concept | Key Takeaway |
|---|---|
| Unicode code point | A unique number assigned to each character (U+0000 to U+10FFFF) |
| UTF-16 | JavaScript's internal encoding; uses 16-bit code units |
| Surrogate pairs | Two code units encoding one supplementary code point (above U+FFFF) |
.length | Counts code units, not characters; emoji and supplementary chars count as 2+ |
codePointAt() | Returns the full code point at a code unit index |
String.fromCodePoint() | Creates a string from code point values (handles any code point) |
[...str] | Spreads by code points, not code units; handles surrogate pairs correctly |
for...of | Iterates by code points; safe for supplementary characters |
normalize() | Converts to a standard form (NFC/NFD); essential for correct comparisons |
\u{XXXXX} | Escape syntax for any code point; cleaner than surrogate pair escapes |
| Grapheme clusters | What humans see as one character; may be multiple code points |
Intl.Segmenter | Segments strings by graphemes, words, or sentences; the correct way to count visual characters |
u flag in regex | Enables full Unicode support: . matches code points, \p{} property escapes |
Understanding JavaScript's string internals is not academic trivia. It is essential for building applications that correctly handle names from every language, display emoji properly, implement accurate character counters, perform reliable string searching and comparison, and avoid subtle bugs that only appear when users write in non-Latin scripts or use emoji. The tools exist: for...of, spread syntax, normalize(), and Intl.Segmenter. The key is knowing when .length and [] indexing are not enough.