What Is UTF-16? (Java, JavaScript, Windows Use It)

Short answer

UTF-16 is a Unicode encoding that uses 16-bit code units. Common characters (English letters, common Asian characters) use one 16-bit unit (2 bytes); rarer characters (most emoji, some historical scripts) use two units paired together (4 bytes total) called a "surrogate pair." JavaScript strings, Java strings, and Windows APIs all use UTF-16 internally.

UTF-16 vs UTF-8

	UTF-8	UTF-16
Code unit	8 bits (1 byte)	16 bits (2 bytes)
Bytes per char (ASCII)	1	2
Bytes per char (CJK)	3	2
Bytes per char (emoji)	4	4 (surrogate pair)
ASCII-compatible	Yes	No
Endianness matters?	No	Yes (BE vs LE; needs BOM)
Used by	Web, Linux, macOS files	JS strings, Java, Windows API

The "JavaScript .length lies" problem

JavaScript strings are sequences of UTF-16 code units, not characters. For most characters this is fine — "Hello".length is 5. But:

"😀".length            // 2 (surrogate pair, but "1 character")
"👨‍👩‍👧‍👦".length      // 11 (multiple emoji + zero-width joiners)
"Café".length          // 4 if é is a single code point, 5 if é is "e" + combining accent

For visible-character counts, use Intl.Segmenter or our character counter which handles this correctly.

Surrogate pairs explained

UTF-16's 16-bit code units can represent values 0-65,535 (the "Basic Multilingual Plane"). Unicode has ~1.1M code points though. To encode characters above U+FFFF, UTF-16 uses two "surrogate" code units paired:

High surrogate: in range U+D800–U+DBFF
Low surrogate: in range U+DC00–U+DFFF

A pair encodes one character. Example: 😀 (U+1F600) becomes 0xD83D 0xDE00 in UTF-16.

Endianness and BOM

UTF-16 has two byte orderings:

UTF-16BE (big-endian) — most-significant byte first: 0xD83D = D8 3D
UTF-16LE (little-endian) — least-significant byte first: 0xD83D = 3D D8

A BOM (Byte Order Mark, U+FEFF) at the start of a file declares which one. Without a BOM, you have to guess. UTF-8 has no endianness issue.

When you'll encounter UTF-16

JavaScript string operations — substring, slice, indexOf operate on UTF-16 code units
Java APIs — String.charAt() returns a 16-bit char
Windows file paths — internally UTF-16
Older XML / .NET file formats — sometimes UTF-16-encoded by default

Conversion gotchas

btoa() in JS only accepts Latin-1. To Base64-encode a Unicode string, encode to UTF-8 bytes first: btoa(unescape(encodeURIComponent(s))) or use TextEncoder.
String length ≠ file size. A 100-character string is 200 bytes in UTF-16, but could be 100-400+ bytes in UTF-8 depending on content.
Iterating by character requires care. for (const c of "😀") in JavaScript correctly iterates by character (handles surrogate pairs); indexed loops with charAt() can split a surrogate pair.

Related tools

Convert text to binary or hex to inspect raw byte representations: text to binary, text to hex. Count characters with proper Unicode awareness: character counter.

What Is UTF-16? (Java, JavaScript, Windows Use It)

Short answer

UTF-16 vs UTF-8

The "JavaScript .length lies" problem

Surrogate pairs explained

Endianness and BOM

When you'll encounter UTF-16

Conversion gotchas

Related tools

Featured Tools

Text to Binary

Text to Hex

Character Counter

Explore 300+ Free Tools