Characters in the wild

This is part 4 of a 5-part series on character encodings in international data journalism. Check out the other pieces.

When I was making my friend proofread the Unicode one, she wrote in all caps halfway through: WHY DOESN’T EVERYONE JUST USE UNICODE?? Good question!

This piece won’t necessarily answer the question, but it will help you get around the problems that you encounter because not everyone uses Unicode and UTF-8.

NOTE: We talk about specific character encodings a lot in this one! Just a reminder, ISO 8859 is the Extended ASCII standard, and all of the variations ISO 8859-1, ISO 8859-6, ISO 8859-15, etc, are all national adaptations.

Review from Unicode and UTF-8 save the day

ASCII is a very simple 7-bit for representing characters (or 8-bit, if you’re counting Extended ASCII). It’s absolutely fantastic until you leave the continental United States, at which point everything goes wrong: it can’t support accents without terrible amounts of hacking, and is probably the #1 impediment to modern cross-cultural exchange.

Then Unicode showed up on a scene, a wonderful 32-bit system for mapping numbers to up to over a million characters. UTF-xx are methods for saving those 32 bits, with UTF-16 and UTF-8 using secret tricks to save space. UTF-8 is pretty normal outside of Asian languages, which use UTF-16.

Character Encodings on the Web

Back in the olden days before HTML5 (pre-2012), web pages defaulted to the Extended ASCII character set known as ISO 8859-1. This is also known as “Latin 1,” and is focused on English and Western European languages.

That meant if you were viewing a Czech site written in a different version of Extended ASCII they had to specify it was ISO 8859-2. Japanese sites had to specify that they were Shift JIS, and Finns had to note ISO-8859-15. Some people used UTF-8 encoding to easily support universal text, but those people were few and far between.

You could specify an encoding with a line in the HTML code like:

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-8">

or for HTML5 you use:

<meta charset="UTF-8">

That way your browser knows what encoding to display. The problem shows up when you don’t specify an encoding, and the browser computer has to guess. If it guesses wrong, the \s become Ös and the s become s and everything looks terrible.

People didn’t specify encodings because they always lived on the same part of the web - Americans stayed on American sites, Koreans stayed on the Korean web, etc. If your computer defaults to Latin 1 and all of the sites you view are Latin 1, that usually won’t cause a problem.

But say you had a Hebrew computer which defaulted to ISO-8859-8 for your encoding. All your documents, all your spreadsheets, etc are all in ISO-8859-8 encoding. You upload a web page and don’t specify an encoding, but it looks perfectly fine to you because your computer always assumes things are ISO-8859-8.

When a Swede visits your site, though, their computer would assume your site was Swedish, show it as ISO-8859-15, and everyone’s characters would get mixed up.

Note: While you’re browsing the web, you can always change the encoding you’re looking at a site through! In Chrome, go to the View menu, then Encoding, and pick something else. Maybe change it a few times for this page to see characters switch up, but make sure you switch back to Auto Detect at the end.

In 2012 HTML5 officially came out, which decreed that the default encoding for the web was now UTF-8. Now a single page can have Ö and and all live together in peace and harmony, and it was the default (see: Defaults matter).

You’ll still have problems, though - not everyone is using the new standard, and plenty of sites are from the HTML4 era. If you’re scraping the web, this is especially important: if you download a (for example) ISO-8859-3 file, be sure you open it as ISO-8859-3.

Special Characters

Even before UTF-8, special characters abounded: ♥ and ™ and → were all over the place. If your text editor only supported ASCII or your site was in Latin-1 (ISO 8859-1), how’d you get them? Through the magic of HTML entities!

HTML entities are special codes in the text of your web page prefixed by an ampersand & and post-fixed by a ;. They let you display symbols your character encoding wouldn’t normally allow.

HTML Entity Code Displays as
&hearts;
&copy; ©
&Dagger;
&#28608;
&agrave; à

There are a lot of them!

Sometimes they have special names, and sometimes you just type in the (hex) number of the Unicode codepoint you were looking for. Even though on a UTF-8 page you can just type ♥ instead of &hearts;, HTML entites are still a common way of representing symbols that are difficult to type. You’d never know unless you viewed the source of a page, though!

HTML entites are a pain when you’re scraping, since every à might actually show up as a &agrave;. You’ll usually want to find a way to convert them before you start working on your text (Here’s a simple BeautifulSoup example from StackOverflow).

Spreadsheets and other files

Sometimes you open up a file and you’re presented with a string of ?????s or little boxes or other nonsense. This is, as you now know, a character encoding problem. Typically you’re opening a file from some bureaucratic arm of a government or business that isn’t cool enough to use UTF-8 yet.

Word/Excel/other Microsoft products: Choose text encoding when you open and save files

Sublime Text: Open your file then select File > Reopen with Encoding and select the appropriate encoding. You can then do File > Save with Encoding and pick UTF-8 to help save the world. It mentions something called BOM a few times, you can read up on that here.

Programming Languages

We’ll cover Python in-depth in the next piece, but character encodings when programming can be a huge huge headache.

Under the hood

When you’re coding, a series of characters like "Cat" is called a string. The way "Cat" is treated by different programming languages varies widely.

Sometimes they’re just strings of bits and bytes.

"Cat" might be 0100 0100 0110 0100 0111 0100. The computer could read this as two UTF-8 characters, or two ISO 8859-15 characters, or two ISO 8859-1 Latin 1 characters, or maybe one UTF-16 character! It’s up to you as the programmer to keep track of what your text is.

Possible problems:

  1. What’s the length of the string? If you read in UTF-16 Chinese, but it’s assuming it’s UTF-8, then you’re going to have a bad time.
  2. Maybe you’re using ISO 8859-15, save it, and then try to open it up as UTF-8. Is it going to display correctly?
  3. You feed a bunch of strings into some sort of analysis function. Does the function care about the encoding? Does it expect a different encoding?

An example of a language that treats strings like this is Ruby 1.8.

Sometimes UTF-8 is special

Python 2 is insane. They do the above for most strings, but then also have special versions of strings just for UTF-8. You’ll learn more about that later.

Sometimes they remember their encoding

Some languages keep a little wrapper around them to tell you what the encoding is. Cat is still 0100 0100 0110 0100 0111 0100, but there’s a note that says “this is Swedish Extended ASCII” or “this is encoded with UTF-8.”

Ruby 1.9 and later does this.

Sometimes they are UTF-16

Java apparently loves the world because it defaults to using UTF-16 for its strings.

When do you know?

You know you’re working with UTF-8 (and other fancier encodings than ASCII) because you see a ton of \ in your output paired with weird numbers and letters.

For example, if we try to look at 你好世界 in Python 2, it displays as this:

\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c

NOTE: he \x means “this is hexadecimal,” which counts from 0-F instead of from 0-9. Hex counts from 0-F (a.k.a. 0-15) because that’s what you can do with 1 byte (a.k.a. 4 bits). 0000 through 1111 = 15 = F.

Seeing this weirdness doesn’t mean something is wrong, it’s not the like seeing question marks and empty boxes in Word. It just means your code isn’t printing the actual characters to your screen, instead it’s printing the numbers that represent the characters. When you save it to a file you’ll probably be fine.

I unfortunately don’t have a good answer to “when do you know something is wrong” other than “something eventually breaks.” I’m open to suggestions!

Running scripts

Sometimes (but not always) you have to specify your encoding for the Python or Ruby scripts you’re running.

If you don’t, the interpreter (the python or ruby command) will think you wrote your code in ASCII, and then freak out when it encounters something non-ASCII. You can put one of the following lines at the top of your .py or .rb file to force it to read your script using a specific encoding.

# -*- coding: utf-8 -*-
# coding: utf-8
# encoding: UTF-8

They’ll work for both Python and Ruby. The interpreter just looks for something that says ‘coding: blahblahblah’, the capitalization and fancy -*- marks are simply convention.

Opening and saving files (when programming)

When you’re opening and saving files in a programming language, try to specify an encoding if you can. Your computer likes to guess, which is fine, but as we saw up above with HTML sometimes it can guess incorrectly and cause problems.

Python 2 Use the codecs library to specify an encoding. The third argument you pass is the encoding (utf-8, utf-16, ISO-8859-1, etc)

import codecs
opened = codecs.open("filename.txt", "r", "ISO-8859-8")

Python 3

opened = open("filename.txt", encoding="utf-8")

Ruby

opened = File.read('filename.txt', encoding: 'iso-8859-1')

More details for Ruby 1.9

Databases

Sometimes your database uses one encoding and your programming language is in another, so you need to be sure they’re in agreement.

When creating your database, you use a line like the following to set the character set.

CREATE DATABASE mydatabasename
  DEFAULT CHARACTER SET utf8
  DEFAULT COLLATE utf8_general_ci;

I know my MySQL frontend always tries to default to Latin-1 and I have to strong-arm it into UTF-8.

Comprehension Quiz

Think you’ve got it? Here’s a tiny quiz:

  1. Do you have to specify an encoding for a web page? How can you change it?
  2. When I’m programming and I see \xe4\xbd\xa0\xe5, what am I probably dealing with?
  3. Do a programming languages know what encoding a string is in?
  4. How can I specify an encoding for a Python or Ruby script I’ve written?
  5. What’s the Python 2 library that helps me read/write files written in specific encodings?

Answers are below.

  1. No, you don’t! As the web page maker you can use a tag, as the viewer you can change it through a menu in your browser.
  2. UTF-8/Unicode characters that won’t display via ASCII (Chinese characters, in that case)
  3. Generally not, although Python 2 has a special type for Unicode..
  4. Using a # encoding: utf-8 line up at the top
  5. codecs

Next steps

Head back to UTF-8, Unicode, Character Encoding, and International Data Journalism → to grab the next section!

Want to hear when I release new things?
My infrequent and sporadic newsletter can help with that.