UTF-8 vs ASCII vs Unicode: What's the Actual Difference?
Unicode is the character set. UTF-8 is the most common encoding. ASCII is a 7-bit subset. Clear explanation of what each term actually means, with code examples.
Three terms, three levels
The confusion comes from treating these as interchangeable when they operate on different levels:
- Unicode is a catalog. It assigns a number (code point) to every character —
Ais U+0041,你is U+4F60, 🙂 is U+1F642. - UTF-8 is an encoding. It defines how to turn those code points into bytes on disk or over the wire.
- ASCII is an older, smaller catalog (128 characters) that happens to be a strict subset of Unicode AND encodes identically in UTF-8 for those 128 characters.
Side-by-side
| ASCII | Unicode | UTF-8 | |
|---|---|---|---|
| Is a... | Character set + encoding | Character set | Encoding |
| Characters | 128 | 1.1M+ defined | Encodes all of Unicode |
| Bytes per char | 1 | N/A (it's a catalog) | 1-4 |
| Year | 1963 | 1991 | 1993 |
| Covers English | Yes | Yes | Yes |
| Covers emoji, Chinese, Arabic | No | Yes | Yes |
Why UTF-8 won
UTF-8 is the dominant encoding on the web (98%+ of pages) for three reasons:
- ASCII-compatible. The first 128 code points encode to the exact same byte as ASCII. Any ASCII file is already a valid UTF-8 file.
- Variable width. Common characters (English) use 1 byte; rarer characters use 2-4. Files stay small for Latin-alphabet content.
- No byte order ambiguity. Unlike UTF-16, UTF-8 has no endianness — the same bytes mean the same thing on every machine.
UTF-16 vs UTF-8
JavaScript strings, Java strings, and Windows APIs use UTF-16 internally. UTF-8 is what files and HTTP bodies use. The conversion between them is cheap but not free — it's the reason string.length in JavaScript gives surprising results for emoji (each emoji is often two UTF-16 code units but one user-perceived character). The character counter handles this correctly.
Common encoding bugs
- Mojibake — the "é" turned into "é" is UTF-8 bytes interpreted as Latin-1. Almost always means someone read the file in the wrong encoding.
- BOM issues — the optional byte-order-mark at the start of a UTF-8 file breaks some tools that assume ASCII.
- Truncation — splitting a UTF-8 string by byte count can split a multi-byte character in half. Always split by code points.
- Base64 — Base64 encodes bytes; for Unicode strings, encode to UTF-8 bytes first.
Inspecting encoded bytes
To see the actual bytes a string uses, paste it into text-to-binary or text-to-hex. You'll see "A" produces one byte (0x41), while "é" produces two (0xC3 0xA9), and "🙂" produces four (0xF0 0x9F 0x99 0x82).
Rule of thumb
Always use UTF-8 for files, databases, APIs, URLs, and HTTP bodies. Always declare it: <meta charset="utf-8">, Content-Type: application/json; charset=utf-8. When you do that, Unicode "just works" for almost all real cases.
Featured Tools
Try these free related tools directly in your browser — no sign-up required.
Text to Binary
Convert plain text to binary code (0s and 1s) instantly. Each character is translated to its 8-bit ASCII binary representation.
Base64 Encoder / Decoder
Encode text or decode Base64 strings instantly online. Convert between plain text and Base64 encoding for data URLs, authentication headers, and API tokens.
URL Encoder / Decoder
Encode or decode URLs and query strings instantly. Convert special characters to percent-encoding and back for safe URL transmission and debugging.
Text to Hex
Convert plain text to hexadecimal encoding instantly. Each character is converted to its hex equivalent. Useful for debugging, encoding, and data analysis.