Unicode and UTF-8

This is part 3 of a 5-part series on character encodings in international data journalism. Check out the other pieces.

Review from The story of ASCII and character encodings

Once upon a time we only had ASCII, which was a 7-bit character set that supported 127 characters. It wasn’t quite enough code points for all the world’s languages, though.

Then came Extended ASCII, which was 8 bits and 255 characters. Still not enough, oddly! It did provide a chance for everyone to make a thousand national variants, though, which still cause problems to this day.

Hexadecimal

Before we start talking about Unicode, we need to take a second to talk about hexadecimal! Hexadecimal (a.k.a. hex) is another numbering system like binary or decimal, except instead of being based on 0-1 or 0-9 it’s 0-F.

Binary Decimal Hex
0000 0 0
0001 1 1
0010 2 2
0011 3 3
1000 8 8
1001 9 9
1010 10 A
1011 11 B
1100 12 C
1101 13 D
1110 14 E
1111 15 F
10000 16 10
10001 17 11
10010 18 12
10010 18 FE
11111110 254 FE
11111111 254 FF

A lot of the time when you’re talking about Unicode, instead of talking about code points using decimal numbers you use hex, just because the numbers are so big.

That means you’ll usually see something like 1F4B8 instead of 128184.

This makes sense because 1) it takes up a lot less space, and 2) every binary 1111 is a hexadecimal F, so it’s easy to convert something like 1111 0000 1111 into F0F, even if you don’t know what the decimal numbering is.

NOTE:: When people talk about different numbering systems, they sometimes signal which one they’re using with a little prefix: 0b means binary, and 0x means hex, so: 0b1110110 would be binary, and 0xFF03A would be hexadecimal.

Unicode

After the world got tired of 7- and 8-bit character encodings, they created a character set called Unicode. They didn’t just hop up to 9 bits, either. They put things into overdrive, my friends, and went right to thirty-two bits. Let’s examine the difference:

Number of bits Largest Number (Binary) Range (Decimal)
7-bit 111 1111 0-127
8-bit 1111 1111 0-255
32-bit 1111 1111 1111 1111 1111 1111 1111 1111 0-4,294,967,295

That’s a lot, right? Due to magic technical specifications, Unicode can’t actually use that entire range, but it does a pretty good job:

Encoding Number of bits Supported Code Points
ASCII 7 127
Extended ASCII 8 255
Unicode 32 1,114,112

Do we have a million different symbols and characters across the planet? No, not yet. Even the big hitter of Chinese only has around fifty thousand.

Currently, Unicode has code points mappings for over 120000 characters, which means it’s at only 10% of its capacity. I guess they learned their lesson and left some room to grow!

Along with characters from languages, emoji also have their own place in Unicode. 128,126 is “alien monster” 👾, and 128,540 is “face with stuck-out tongue and winking eye” 😜.

Emoji Unicode Code Point Name Code Point in Hex
😱 128561 face screaming in fear U+1F631
🐘 128024 elephant U+1F418
💸 128184 money with wings U+1F4B8

When you talk about Unicode code points with hex, you put a U+ in front of it so you know you’re talking Unicode. You can see the entire list over at Unicode.org’s Emoji List.

All of this extra space for more characters should really make our issues go away, but one question remains: if every symbol needs to have a number associated with it, how can we count that high?

Introducing: UTF-32, UTF-16, and UTF-8!

The UTF Crew

Lots of people still use ASCII because it’s nice and small, and was the default for very many years. 32-bit numbers take up, well, 32 bits, while an 8-bit number can’t count as high, but only takes up 8 bits.

So: if you use an 8-bit encoding instead of a 32-bit encoding, your files will take up one-quarter as much space. That means a 5MB 8-bit file would be 20MB if it was saved as a 32-bit file.

UTF stands for the easy-to-remember “Universal Coded Character Set + Transformation Format,” and (sometimes) tries to solve that problem.

What’s the difference between Unicode and UTF-xx? Unicode is the system of code points that maps a number to a symbol. These UTF things are the format in which you store those numbers (it’ll hopefully make more sense when you read through).

UTF-32

When we wanted to get an ASCII A, we’d just use the binary for 65: 100 0001. If we used extended ASCII, it’s still 65, it just has an extra bit, so the binary number is a little longer: 0100 0001.

Unicode, though, can count really, really high. Instead of 7 or 8 bits, it supports 32, so 65 would look like 0000 0000 0000 0000 0000 0000 0100 0001. If we want a capital B, we’d need to count to 66, which is the equally-absurd 0000 0000 0000 0000 0000 0000 0100 0010. Just the word “Cat” would look something like:

UTF-32 Number Character
0000 0000 0000 0000 0000 0000 0100 0011 67 C
0000 0000 0000 0000 0000 0000 0110 0001 97 a
0000 0000 0000 0000 0000 0000 0111 0100 116 t

Look at all those 0s lined up! This is called UTF-32, because you’re using all 32 bits to represent the number. It’s also a heck of a lot of extra space.

UTF-16

If you’re a little worried about all of those seemingly-useless 0s at the beginning of every number, you might use UTF-16 instead.

UTF-16 is kind of a misnomer, because it doesn’t only use 16-bits, it uses something called variable-length encoding. Variable-length encoding lets you sometimes use 16 bits to represent a character, and sometimes combine two 16-bit numbers into one big 32-bit number. This means it’s usually just using 16 bits, so you’re saving all of the extra space that UTF-32 takes up.

UTF-32 UTF-16
0000 0000 0000 0000 0000 0000 0100 0011 0000 0000 0100 0100
0000 0000 0000 0000 0000 0000 0110 0001 0000 0000 0110 0100
0000 0000 0000 0000 0000 0000 0111 0100 0000 0000 0111 0100

I can’t for the life of me describe how variable encoding of UTF-16 works, but you can read the Wikipedia page if you’d like to know more. There’s also a thing called UCS-2 that was an easier version of UTF-16, the difference is that it wasn’t variable-length encoding (UCS-2 was stuck at 16 bits always and forever).

UTF-8

But the one you’ve probably heard of is UTF-8, which uses only 8 bits. It seems like it wouldn’t allow you to represent much at all, but you’d be wrong! Variable encoding to the rescue again.

NOTE: You probably don’t need to know all this! The takeaway is that usually UTF-8 is great unless you’re writing in Chinese, Korean or Japanese, in which case UTF-16 will actually take up less space.

Remember how ASCII was only 7 bits? Those seven bits of 100 0100 are 67 which is translated into C. It wasn’t until extended ASCII that they expanded to using an eighth bit.

UTF-8 is perfectly compatible with ASCII values, and it handles them beautifully. As long as the first bit is zero, (the 128ths place) UTF-8 knows you’re only looking for ASCII values. It’s like having a 7-bit value with a 1-bit placeholder.

UTF-16 UTF-8 Number ASCII
0000 0000 0100 0100 0100 0100 67 C
0000 0000 0110 0100 0110 0100 97 a
0000 0000 0111 0100 0111 0100 116 t
0000 0000 1111 0100 1111 0100 244 HOLD EVERYTHING

Once you add a 1 in that 128ths place at the beginning of a UTF-8 number, things get crazy. UTF-8 goes into variable encoding mode and can suddenly combine 8-bit sections to create numbers larger than 8 bits.

So as long as you’re writing in ASCII you’ll never need to use those variable-length-encoding codes, and as a result your file stays at an 8-bit size. To get the details of how the variable-length encoding works, read this great piece (although it doesn’t agree that UTF-16 is variable-length).

If you wind up using a few emoji here or there, UTF-8 will add the magic 1 at the beginning of an 8-bit section and voilà, it understands you’re after something 16- or 32-bit, and saves the number appropriately. More special codes explain how many more bits you’re looking for.

When you’re saving something more than 7 bits (code points greater than 127) with UTF-8, there’s a little bit of extra room taken up, i.e. it takes more than 16 bits to store a 16-bit character. That room is 1) the signal that you’re about to store a bigger number, and 2) the metadata about how big that bigger number is.

Usually this isn’t a big deal. If you splash an emoji in here or there in what’s normally un-accented English (standard ASCII), most characters only take up 8 bits.

But! Things get complicated if you’re using a CJK language (Chinese, Japanese, Korean). None of those characters are in the ASCII/7-bit space, so you’ll need to use the variable-length signal 1 again and again and again, along with adding in “how many 8-bit sections” codes. This winds up taking up a lot of space!

Because there’s no reason to waste your time yelling “this next piece is going to be 16-bits” over and over again, in those sorts of situations it’s usually better to use UTF-16. It ends up saving a lot of space that is taken up by those variable-length codes.

Generally speaking, through, UTF-8 is gr8.

TL;DR

Once upon a time, ASCII was invented. It was fine, but caused a lot of trouble.

Then one day Unicode showed up, which currently represents about 120,000 characters but could theoretically represent over a million. Instead of 7 or 8 bits, it uses 32 bits to represent each of those characters. Most of the time you don’t need to count to the higher numbers, though, so it creates a lot of wasted space.

To combat the size of 32-bit UTF-32 files, UTF-8 cheats and allows you to use abbreviate out some of of the 0s for smaller numbers. UTF-16 does this to a smaller degree, but is probably better for non-Latin languages.

Comprehension Quiz

Think you’ve got it? Here’s a tiny quiz:

  1. What standard does Unicode easily replace?
  2. Over how many code points does Unicode support? For extra credit, how many does it actually use?
  3. DEADBEEF isn’t just a creepy word, it’s also the number 3,735,928,559. What numbering system is DEADBEEF in? That numbering system goes from 0 to what?
  4. If I’m generally writing in English, what encoding system should I use to save my Unicode files?
  5. When might you use UTF-16?

Answers are below. I’ve typed out the numbers so it’s a little harder to accidentally cheat.

  1. ASCII
  2. It supports over one million, but only uses about 120,000.
  3. Hexadecimal, which goes from 0 to F.
  4. UTF-8
  5. If you’re writing in Chinese, Japanese or Korean

Next steps

Head back to UTF-8, Unicode, Character Encoding, and International Data Journalism → to grab the next section!

Want to hear when I release new things?
My infrequent and sporadic newsletter can help with that.