One of the biggest problems with analyzing non-English-language data involves the way computers think about the characters that appear on your screen.
This is a 5-part series that introduces you to why you’ll have problems with international data, and the steps to take to fix them. It seems like a lot, but you’ll practically be a genius by the time you’re finished.
First, you need to go read my summary of how binary works. You’re allowed to skip it if you can answer the following:
01 1101 0100
?1111
in decimal?If you even questioned yourself a little bit, go read it. It won’t take long.
Once we understand how binary and bits work, we can move on to ASCII and other character encodings - how those bits are used to represent letters and other symbols inside of a computer.
You’re allowed to skip it if you can answer the following:
\
show up as Ö
when you open a file?No, honestly, go read it. I thought I knew a ton about encodings when I started writing this guide, but it turns out I was sorely mistaken. The world makes more sense now.
Now that we know ASCII, it’s time to meet its big brother Unicode.
You can skip it if you can answer the following:
DEADBEEF
written in?Now we know what UTF-8 and Unicode and ASCII and Extended ASCII and everything else in the world is, so now we need to know where these things rear their little fearsome heads.
You can skip it if you can answer the following:
\xe4\xbd\xa0\xe5
, what am I probably dealing with?This is the part where I convince you to upgrade to Python 3, and then give you a step-by-step on how to do it.