Glossary

What Is UTF-16? (Java, JavaScript, Windows Use It)

UTF-16 encodes Unicode using 16-bit code units (2 or 4 bytes per character). Native to JavaScript strings, Java, and Windows APIs. Plain explanation of surrogate pairs.

Short answer

UTF-16 is a Unicode encoding that uses 16-bit code units. Common characters (English letters, common Asian characters) use one 16-bit unit (2 bytes); rarer characters (most emoji, some historical scripts) use two units paired together (4 bytes total) called a "surrogate pair." JavaScript strings, Java strings, and Windows APIs all use UTF-16 internally.

UTF-16 vs UTF-8

UTF-8UTF-16
Code unit8 bits (1 byte)16 bits (2 bytes)
Bytes per char (ASCII)12
Bytes per char (CJK)32
Bytes per char (emoji)44 (surrogate pair)
ASCII-compatibleYesNo
Endianness matters?NoYes (BE vs LE; needs BOM)
Used byWeb, Linux, macOS filesJS strings, Java, Windows API

The "JavaScript .length lies" problem

JavaScript strings are sequences of UTF-16 code units, not characters. For most characters this is fine — "Hello".length is 5. But:

"😀".length            // 2 (surrogate pair, but "1 character")
"👨‍👩‍👧‍👦".length      // 11 (multiple emoji + zero-width joiners)
"Café".length          // 4 if é is a single code point, 5 if é is "e" + combining accent

For visible-character counts, use Intl.Segmenter or our character counter which handles this correctly.

Surrogate pairs explained

UTF-16's 16-bit code units can represent values 0-65,535 (the "Basic Multilingual Plane"). Unicode has ~1.1M code points though. To encode characters above U+FFFF, UTF-16 uses two "surrogate" code units paired:

  • High surrogate: in range U+D800–U+DBFF
  • Low surrogate: in range U+DC00–U+DFFF

A pair encodes one character. Example: 😀 (U+1F600) becomes 0xD83D 0xDE00 in UTF-16.

Endianness and BOM

UTF-16 has two byte orderings:

  • UTF-16BE (big-endian) — most-significant byte first: 0xD83D = D8 3D
  • UTF-16LE (little-endian) — least-significant byte first: 0xD83D = 3D D8

A BOM (Byte Order Mark, U+FEFF) at the start of a file declares which one. Without a BOM, you have to guess. UTF-8 has no endianness issue.

When you'll encounter UTF-16

  • JavaScript string operations — substring, slice, indexOf operate on UTF-16 code units
  • Java APIs — String.charAt() returns a 16-bit char
  • Windows file paths — internally UTF-16
  • Older XML / .NET file formats — sometimes UTF-16-encoded by default

Conversion gotchas

  • btoa() in JS only accepts Latin-1. To Base64-encode a Unicode string, encode to UTF-8 bytes first: btoa(unescape(encodeURIComponent(s))) or use TextEncoder.
  • String length ≠ file size. A 100-character string is 200 bytes in UTF-16, but could be 100-400+ bytes in UTF-8 depending on content.
  • Iterating by character requires care. for (const c of "😀") in JavaScript correctly iterates by character (handles surrogate pairs); indexed loops with charAt() can split a surrogate pair.

Related tools

Convert text to binary or hex to inspect raw byte representations: text to binary, text to hex. Count characters with proper Unicode awareness: character counter.

Featured Tools

Try these free tools directly in your browser — no sign-up required.

what is utf-16 utf-16 vs utf-8 surrogate pairs javascript string encoding utf-16 explained

Explore 300+ Free Tools

Utilko has tools for developers, writers, designers, students, and everyday users — all free, all browser-based.