This is part 2 of a 5-part series on character encodings in international data journalism. Check out the other pieces.
Once you know the basics of bits and binary, now we can move on to representing characters on a screen.
Binary is a way of counting that only uses 0
and 1
! It looks kind of like this:
Binary number | No. of 2s | No. of 1s | Represents | Decimal |
---|---|---|---|---|
00 |
0 | 0 | 0 + 0 | 0 |
01 |
0 | 1 | 0 + 1 | 1 |
10 |
1 | 0 | 2 + 0 | 2 |
11 |
1 | 1 | 2 + 1 | 3 |
Every one of those binary digits are called a bit.
Except for very boring theoretical physicists, humans generally communicate using characters, not numbers. But since computers only understand numbers, we’re going to need a system to convert numbers to typed characters. Luckily, this has already been done for us.
Once upon a time a bunch of Americans got together and decided on a system that represented all of the important stuff.
The people in charge were very excited to have created this system, which is odd because they had just copied it right from telegraph codes. Everything’s got to come from somewhere, I guess? This system they created was called ASCII (ASS-kee), the American Standard Code for Information Interchange, and here are two true facts about it:
This mapping of numbers to letters is called a character set, and you’ll see that ASCII caused the creation of quite a handful. Let’s see what happened!
ASCII is very simple because it only uses the numbers 0-127 to represent its entire character set (that’s 128 different numbers, remember to count the 0). Here are a few sample sections:
Binary | Decimal | ASCII character |
---|---|---|
010 1001 |
41 | ) |
010 1010 |
42 | * |
010 1011 |
43 | + |
… | … | … |
100 0001 |
65 | A |
100 0010 |
66 | B |
100 0011 |
67 | C |
… | … | … |
110 0111 |
103 | g |
110 1000 |
104 | h |
110 1001 |
105 | i |
Why 128? In binary, 0 through 127 is 000 0000
to 111 1111
, a nice set of 7 bits. It seemed to fit all the characters you’d ever need - lowercase, uppercase, numbers, punctuation - and didn’t waste any space in the process. The more bits you take up, the larger files and programs would be, so keeping it at 7 bits meant everything was nice and small.
ASCII was, above all else, efficient.
NOTE: Why does ASCII wait until 65 to get to
A
? The first 32 are “non-printable control characters,” things like “this is a new line” which need a character to be expressed just the same way a member of the alphabet does.
Great, fantastic, problem solved! What could ever go wrong? Who on earth would ever use anything other than just those English letters? Let’s just all get married to ASCII and be done with it.
Unfortunately the rest of the world existed (still does), and it hadn’t planned on leaving anytime soon (still hasn’t). This was to prove problematic.
The big issue was that ASCII couldn’t add any new characters. It had its non-accented English alphabet, and that was all it could fit. Each of those 128 numbers were spoken for, assigned to uppercase letters or punctuation or spaces or whatever else. Nothing new. No way, no how. Not possible.
But how would the French wish you bon appétit? Could Icelanders ever talk about hákarl again? Would the Japanese just have to type out “konnichiwa” instead of the delightful-looking こんにちは? Only having plain English letters was trouble.
And if — god forbid — you wanted to count above 127 to make a little room for more letters, you’d have to upgrade from a maximum of 111 1111
(127, 7 bits) to 1111 111
(255, 8 bits). That meant adding a whole extra bit, and changing the standard from being 7 bits to 8 bits. And that was out of the question, since standards just can’t change.
European nations made it work for them by personalizing ASCII, replacing a few symbols here and there with the characters they’d need.
Every one of these numbers, 0-127, is called a code point. Code points are the map between numbers and symbols, and Europe was about to change them all about
In ASCII, the code point 92 was \
, which Sweden didn’t think they needed, so they changed 92 to mean the Ö
character. Germans threw out the {
in favor of an ä
, France replaced the @
with an à
, and eventually there were dozens of variants. Canada had two to itself!
Russia, which uses the Cyrillic alphabet instead of the Latin alphabet, went ahead and replaced all of the code points for lowercase Latin characters with uppercase Cyrillic†. I’d like to think this permanent capslock was a major factor in the Cold War (“ЗДРАВСТВУЙТЕ, KENNEDY!”).
Let’s take an example from Wikipedia: say you typed “No, I have sandwiches” in Swedish, but your message was read by people using other regional variants of ASCII.
Official Character Set Code | Name | Text |
---|---|---|
ISO646-SE | Swedish ASCII | Nä jag har smörgåsar |
ISO646 | US ASCII | N{ jag har sm|rg}sar |
ISO646-ES | Spanish ASCII | N° jag har smñrgçsar |
ISO646-NO | Norwegian ASCII | Næ jag har smørgåsar |
This meant that every time you sent a message or a file to someone in another country you didn’t only have to worry about the language barrier, but also make sure the encoding was exactly correct. Otherwise your code points would get all mixed up! That’s a lot of trouble, right?
Don’t get too depressed yet, because it only gets worse (…kind of).
Despite all of the other digital ink we’ll spill, this is still a problem. When you open a file, if you don’t have the character set (a.k.a. charset right), you’ll end up with all sorts of weird characters substituting for the more uncommon ones. Slashes, pipes, accent marks, and special kinds of spaces are all especially prone to this.
† Russia only using capital letters is a bit of an exaggeration. Although they didn’t have lowercase and uppercase, they replaced another two codes to mean “the text after this point is lowercase” and “the text after this point is uppercase.”
Eventually ANSI - they’re the ones in charge of this - did add in that other binary digit to make a new standard called Extended ASCII, an 8-bit character encoding. Now that we could count up to 1111 1111
the range upgrade from 0-127 to 0-255, which meant you include include another 127 brand new code points (and s a result, 127 brand new characters).
What characters? It wasn’t quite the same for everyone, as there still wasn’t quite enough space to cover all the world’s languages. In the same way that ASCII had fragmented into a million regional versions, the trend continued with extended ASCII.
Typically the initial 127 code points were the same or similar to US-ASCII, but the extended region of the new 127 characters split off into variant scripts.
Code Point | ISO 8859-1 (Latin 1) | ISO 8859-2 (Latin 2) | ISO 8859-5 (Latin/Cyrillic) | JIS X 0201 (Japanese) | TSCII (Tamil) |
---|---|---|---|---|---|
70 | F | F | F | F | F |
71 | G | G | G | G | G |
72 | H | H | H | H | H |
… | … | … | … | … | … |
216 | Ø | Ř | и | リ | ழு |
217 | Ù | Ů | й | ル | ளு |
218 | Ú | Ú | к | レ | று |
Even for those who tried to adapt, the new system caused problems: Japanese has three alphabets*, each with at least 46 characters apiece. Since there was no way all of that could fit into Extended ASCII’s 127 new spots, in the beginning Japan just completely gave up two of their alphabets and only included one in the character set.
Chinese is an even bigger issue: to read a newspaper in Chinese you need to know about 2,000 different characters. Overall the language has more than fifty thousand possible characters. So yes, that’s not fitting in the 127 (or even 255).
And with modern-day 20/20 hindsight: on top of all of this, where do we put the emoji?
This extra space to play around in led to many, many years of many, many language-specific versions of ASCII. When you opened a file, you needed to know exactly what encoding it is. Whereas before there were only a few characters that might conflict (see the Swedish example above), this time you had a whole half of the character set that might not match up.
If you got your character set wrong, the first 127 ASCII characters might all look fine, but your カ
s might be ¶
s and your ň
s might be ò
s and you’d never ever get any real work done.
Encoding | Input | US-ASCII Output |
---|---|---|
Shift-JIS (Japanese) | それはあなたの鉛筆ですか | ????͂??Ȃ??̉??M?ł??? |
KOI8-R (Russian) | Здравствуйте, я медведь | ????????????, ? ??????? |
ISO 8859-15 (Latin 9, Swedish) | Jag förstår inte | Jag f?rst?r inte |
It’s a wreck, right? Take a look at 16 different versionf the ISO 8859 standard, to see what kind of terrifying variations are in there. If you open a file and see gibberish, I guarantee this is what’s happening.
This definitely still happens today, but there’s a light at the end of the tunnel. If 7 bits isn’t enough, and 8 bits isn’t enough, is 9 the next step? No ma’am, we’re going to conquer the Tower of Babel with thirty-two bits.
* Yes, technically speaking Japanese doesn’t have three alphabets, it has two syllabaries and one set of logograms.
Once upon a time some Americans though 128 characters would be enough for everyone, and made a list of what numbers meant what characters. This was called ASCII.
Other countries found ö
much more useful than |
, so they shuffled around the US version of ASCII to create their own encodings. As a result, if you guess the wrong encoding for a file you’ll end up looking at all sorts of weird characters.
Then a new version of ASCII was made with a whole 255 characters. Every country took those extra 127 and ran with them, creating dozens of different character sets with even more changes for perilous textual conflicts.
Think you’ve got it? Here’s a tiny quiz:
ISO 8859-1
better known as?Answers are below
\
became Ö
)Now you can back to UTF-8, Unicode, Character Encoding, and International Data Journalism → to learn about the 32-bit power of Unicode.