What Is UTF-16? (Java, JavaScript, Windows Use It)
UTF-16 encodes Unicode using 16-bit code units (2 or 4 bytes per character). Native to JavaScript strings, Java, and Windows APIs. Plain explanation of surrogate pairs.
Short answer
UTF-16 is a Unicode encoding that uses 16-bit code units. Common characters (English letters, common Asian characters) use one 16-bit unit (2 bytes); rarer characters (most emoji, some historical scripts) use two units paired together (4 bytes total) called a "surrogate pair." JavaScript strings, Java strings, and Windows APIs all use UTF-16 internally.
UTF-16 vs UTF-8
| UTF-8 | UTF-16 | |
|---|---|---|
| Code unit | 8 bits (1 byte) | 16 bits (2 bytes) |
| Bytes per char (ASCII) | 1 | 2 |
| Bytes per char (CJK) | 3 | 2 |
| Bytes per char (emoji) | 4 | 4 (surrogate pair) |
| ASCII-compatible | Yes | No |
| Endianness matters? | No | Yes (BE vs LE; needs BOM) |
| Used by | Web, Linux, macOS files | JS strings, Java, Windows API |
The "JavaScript .length lies" problem
JavaScript strings are sequences of UTF-16 code units, not characters. For most characters this is fine — "Hello".length is 5. But:
"😀".length // 2 (surrogate pair, but "1 character")
"👨👩👧👦".length // 11 (multiple emoji + zero-width joiners)
"Café".length // 4 if é is a single code point, 5 if é is "e" + combining accent
For visible-character counts, use Intl.Segmenter or our character counter which handles this correctly.
Surrogate pairs explained
UTF-16's 16-bit code units can represent values 0-65,535 (the "Basic Multilingual Plane"). Unicode has ~1.1M code points though. To encode characters above U+FFFF, UTF-16 uses two "surrogate" code units paired:
- High surrogate: in range U+D800–U+DBFF
- Low surrogate: in range U+DC00–U+DFFF
A pair encodes one character. Example: 😀 (U+1F600) becomes 0xD83D 0xDE00 in UTF-16.
Endianness and BOM
UTF-16 has two byte orderings:
- UTF-16BE (big-endian) — most-significant byte first: 0xD83D =
D8 3D - UTF-16LE (little-endian) — least-significant byte first: 0xD83D =
3D D8
A BOM (Byte Order Mark, U+FEFF) at the start of a file declares which one. Without a BOM, you have to guess. UTF-8 has no endianness issue.
When you'll encounter UTF-16
- JavaScript string operations — substring, slice, indexOf operate on UTF-16 code units
- Java APIs — String.charAt() returns a 16-bit char
- Windows file paths — internally UTF-16
- Older XML / .NET file formats — sometimes UTF-16-encoded by default
Conversion gotchas
- btoa() in JS only accepts Latin-1. To Base64-encode a Unicode string, encode to UTF-8 bytes first:
btoa(unescape(encodeURIComponent(s)))or useTextEncoder. - String length ≠ file size. A 100-character string is 200 bytes in UTF-16, but could be 100-400+ bytes in UTF-8 depending on content.
- Iterating by character requires care.
for (const c of "😀")in JavaScript correctly iterates by character (handles surrogate pairs); indexed loops with charAt() can split a surrogate pair.
Related tools
Convert text to binary or hex to inspect raw byte representations: text to binary, text to hex. Count characters with proper Unicode awareness: character counter.
Featured Tools
Try these free tools directly in your browser — no sign-up required.
Text to Binary
Convert plain text to binary code (0s and 1s) instantly. Each character is translated to its 8-bit ASCII binary representation.
Text to Hex
Convert plain text to hexadecimal encoding instantly. Each character is converted to its hex equivalent. Useful for debugging, encoding, and data analysis.
Character Counter
Count characters with and without spaces in any text instantly. Perfect for Twitter, SMS, meta descriptions, and platforms with strict character limits.