Python 2, Python 3 and UTF-8

This is part 5 of a 5-part series on character encodings in international data journalism. Check out the other pieces.

Once upon a time there was Python 2, and then there was Python 3. Unlike a lot of upgrades, you couldn’t run Python 2 code using the new version of Python! Because a lot of popular libraries were written in Python 2, people have been very very slow to adopt Python 3.

People who learn Python to this day are still learing Python 2, because hey, why not? Who cares? The answer is you care, and UTF-8 cares, and Unicode cares, and this article is going to force you to use Python 3.

Review from Character sets and encoding in the wild

UTF-8 is pretty great, but not all programming languages love it.

Python and character encodings

Python 2 doesn’t give a damn what your strings are encoded as. Latin 1, Latin 2, Shift-JIS, everything is fine. Doesn’t keep track of them, either, that’s up to you!

Python 2 also has a special Unicode string, where 'Cat' would be the normal string and u'Cat' would be the Unicode version.

For Python 3, by default every string is UTF-8. This doesn’t seem like that big of a change, but it makes a lot of things Just Work that used to be problematic.

Compare and Contrast

I put this together as two IPython notebooks, too: Python 2, Python 3.

Let’s compare what happens if you run the following code in an IPython notebook with Python 2 and Python 3.

Command Python 3 Python 2
print 'hello world' hello world hello world
'hello world' 'hello world' 'hello world'
print '你好世界' 你好世界 你好世界
'你好世界' '你好世界' '\xe4\xbd\xa0\xe5
\xa5\xbd\xe4\xb8
\x96\xe7\x95\x8c'
requests
.get("http://djchina.org")
.text
<h2>资源</h2> <h2>\u8d44\u6e90</h2>
import pandas as pd
utf8_df.to_csv("../output.csv")
Works fine UnicodeEncodeError:
'ascii' codec can't encode characters
Opening a UTF-8 file with accented characters Works fine Horrible errors

So more or less, Python 3 does everything right. It opens, saves, and looks at Unicode/UTF-8 perfectly, while Python 2 keeps forgetting it doesn’t care about what your strings are and tries to treat them as ASCII (and throws an error in the process).

Working with Python 2

If you can’t use Python 3, you can try to make things work with Python 2. You have two main saviors in Python 2:

1. The codecs library

codecs allows you to specify and encoding when opening files for reading and writing. You can open a UTF-8 file for reading like so:

import codecs
opened = codecs.open("filename.txt", "r", "utf-8")

2. Hacking sys

The following code forces Python 2 to use UTF-8. It’s very discouraged, but it totally works better than anything else.

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

The big issue that comes up is that you can’t use print from IPython Notebook any more (it prints to the command line, not to your notebook). There are other issues, but

4. .encode and .decode

Come on, don’t do this to yourself. Just move to Python 3! You could also try to read this summary if you are especially masochistic.

A video

Working with Python 3

The thing is about Python 3, is you can have Python 2 and Python 3 running on the same system. You have no good reason to only use Python 2!

Differences between Python 2 and Python 3

  1. You have to use print("Hello world") instead of print "Hello world"
  2. UTF-8 works easily and flawlessly

That is, more or less, the difference.

How to install Python 3 if you used Anaconda to install Python 2/IPython Notebook

NOTE: If you didn’t use Anaconda to install Python, you should think about it. It’s a complete install that comes with all of the fun packages - numpy, scikit-learn, matplotlib, NLTK, IPython Notebook, etc - and none of the hassle.

After you do the following steps, your default Python will still be Python 2, but you’ll be able to run Python 3 using python3. Python 3 will also be available as an option in your IPython Notebook.

Run the following commands from the command line.

1. Check and see if there will be a problem.

First, you’ll need to look at your PYTHONPATH.

echo $PYTHONPATH

Did it display something like /usr/local/lib/python2.7/site-packages:? Then you’re in trouble, and you should email me for a longer explanation. If it didn’t display anything, then you’re good to go with the rest of this.

2. Save a copy of the Python 2 kernel

The Python 2 kernel is the thing that runs Python 2 code (…kind of). We want to save a copy of it so that IPython Notebooks will always be able to use it.

ipython kernelspec install-self --user

3. Create a new environment that supports Python 3. This will cause Python 3 to be installed.

An ‘environment’ is just a specific version of Python with a bunch of libraries. By default you’re running everything on your system in one Python 2 environment. Anaconda can create a new environment that will install Python 3 for you.

conda create -n python3 python=3 anaconda

NOTE: You can also create environments for all sorts of other stuff, like maybe you have code that relies on Python 2.0.5 or a special matplotlib or something. If you’re especially responsible every project you create has its own environment! If you hear about venv or virtualenv that’s what people are talking about.

4. Switch to the Python 3 environment.

Even though you created the environment, you aren’t using it yet. Use the following command to switch to the Python 3 environment.

source activate python3

5. Save the the Python 3 kernel.

Now, just like we did before with Python 2, we’re going to save the Python 3 kernel so IPython Notebook can use it.

ipython kernelspec install-self --user

6. Switch away from the Python 3 environment, so you’re back to your default Python 2 environment.

And since we don’t want to default to Python 3 (…do we?), let’s just get out of the environment for right now.

source deactivate

7. Start up a new Terminal window

In order to have the changes take effect, you’ll need to close your terminal window and open up a new one. And then if you want to run some IPython Notebooks it’s the same old command:

ipython notebook

When you click New it will give you the option of Python 3.

Additionally, if you want to run a Python 3 script from the commandline, use

python3 your_script.py

And you should be good to go!

NOTE: If you’re having problems with your kernel starting up, make sure you opened a new Terminal. You can also try restarting your computer.

8. Installing new packages

But!!! Let’s say you need a library like nameparser that doesn’t come with Anaconda? You’ll need to hop into your Python 3 environment and install the libray:

source activate python3
pip install nameparser
source deactivate python3

It pops you into the Python 3 environment and uses the Python 3 version of pip to install nameparser. And then you should be good to go.

Comprehension Quiz

Think you’ve got it? Here’s a tiny quiz:

  1. Are you going to use Python 2 any more?
  2. What’s different about the print statement between Python 2 and Python 3?
  3. What’s different about Python 2 and Python 3 in regards to string handling/Unicode/UTF-8?
  4. When using Python, what is an environment?
  5. If you update to Python 3 using the directions above, can you run it without activating the Python 3 environment? What’s the command to run a script using Python 3?
  6. Are you going to use Python 2 any more????

Answers are below. I’ve typed out the numbers so it’s a little harder to accidentally cheat.

  1. NO.
  2. Python 3 does print("hello world") while Python 2 does print "hello world"
  3. Strings are UTF-8 by default in Python 3.
  4. A specific version of Python + a set of libraries.
  5. python3
  6. NO!!!!

Mission complete

You’ve finished learning absolutely everything about character encodings and hopefully you’re now an A+ expert international Python programmer.

Head back to the international data journalism homepage and maybe there are some other goodies there for you.

Want to hear when I release new things?
My infrequent and sporadic newsletter can help with that.