Unicode, UTF-8, and Worldwide Character Encodings

One of the biggest problems with analyzing non-English-language data involves the way computers think about the characters that appear on your screen.

This is a 5-part series that introduces you to why you’ll have problems with international data, and the steps to take to fix them. It seems like a lot, but you’ll practically be a genius by the time you’re finished.

1. Binary basics

First, you need to go read my summary of how binary works. You’re allowed to skip it if you can answer the following:

  1. How many bits is 01 1101 0100?
  2. What is 1111 in decimal?
  3. There are 10 types of people in the world, those who understand binary and those who don’t. LOL?

If you even questioned yourself a little bit, go read it. It won’t take long.

2. The story of ASCII and character encodings

Once we understand how binary and bits work, we can move on to ASCII and other character encodings - how those bits are used to represent letters and other symbols inside of a computer.

You’re allowed to skip it if you can answer the following:

  1. Why might \ show up as Ö when you open a file?
  2. How many more characters does Extended ASCII support when compared to ASCII?
  3. What language could never ever use ASCII?

No, honestly, go read it. I thought I knew a ton about encodings when I started writing this guide, but it turns out I was sorely mistaken. The world makes more sense now.

3. Unicode and UTF-8 save the day

Now that we know ASCII, it’s time to meet its big brother Unicode.

You can skip it if you can answer the following:

  1. What’s the difference between Unicode and UTF-8?
  2. What numbering system is DEADBEEF written in?
  3. How can UTF-8 express 32-bit characters?

4. Character sets and encoding in the wild

Now we know what UTF-8 and Unicode and ASCII and Extended ASCII and everything else in the world is, so now we need to know where these things rear their little fearsome heads.

You can skip it if you can answer the following:

  1. Do you have to specify an encoding for a web page? How can you change it?
  2. When I’m programming and I see \xe4\xbd\xa0\xe5, what am I probably dealing with?
  3. What’s the Python 2 library that helps me read/write files written in specific encodings?

5. Python and UTF-8

This is the part where I convince you to upgrade to Python 3, and then give you a step-by-step on how to do it.

Want to hear when I release new things?
My infrequent and sporadic newsletter can help with that.