This is part 5 of a 5-part series on character encodings in international data journalism. Check out the other pieces.
Once upon a time there was Python 2, and then there was Python 3. Unlike a lot of upgrades, you couldn’t run Python 2 code using the new version of Python! Because a lot of popular libraries were written in Python 2, people have been very very slow to adopt Python 3.
People who learn Python to this day are still learing Python 2, because hey, why not? Who cares? The answer is you care, and UTF-8 cares, and Unicode cares, and this article is going to force you to use Python 3.
UTF-8 is pretty great, but not all programming languages love it.
Python 2 doesn’t give a damn what your strings are encoded as. Latin 1, Latin 2, Shift-JIS, everything is fine. Doesn’t keep track of them, either, that’s up to you!
Python 2 also has a special Unicode string, where 'Cat'
would be the normal string and u'Cat'
would be the Unicode version.
For Python 3, by default every string is UTF-8. This doesn’t seem like that big of a change, but it makes a lot of things Just Work that used to be problematic.
I put this together as two IPython notebooks, too: Python 2, Python 3.
Let’s compare what happens if you run the following code in an IPython notebook with Python 2 and Python 3.
Command | Python 3 | Python 2 |
---|---|---|
print 'hello world' |
hello world |
hello world |
'hello world' |
'hello world' |
'hello world' |
print '你好世界' |
你好世界 |
你好世界 |
'你好世界' |
'你好世界' |
'\xe4\xbd\xa0\xe5 \xa5\xbd\xe4\xb8 \x96\xe7\x95\x8c' |
requests .get("http://djchina.org") .text |
<h2>资源</h2> |
<h2>\u8d44\u6e90</h2> |
import pandas as pd utf8_df.to_csv("../output.csv") |
Works fine | UnicodeEncodeError: 'ascii' codec can't encode characters |
Opening a UTF-8 file with accented characters | Works fine | Horrible errors |
So more or less, Python 3 does everything right. It opens, saves, and looks at Unicode/UTF-8 perfectly, while Python 2 keeps forgetting it doesn’t care about what your strings are and tries to treat them as ASCII (and throws an error in the process).
If you can’t use Python 3, you can try to make things work with Python 2. You have two main saviors in Python 2:
codecs
librarycodecs
allows you to specify and encoding when opening files for reading and writing. You can open a UTF-8 file for reading like so:
import codecs
opened = codecs.open("filename.txt", "r", "utf-8")
sys
The following code forces Python 2 to use UTF-8. It’s very discouraged, but it totally works better than anything else.
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
The big issue that comes up is that you can’t use print
from IPython Notebook any more (it prints to the command line, not to your notebook). There are other issues, but
.encode
and .decode
Come on, don’t do this to yourself. Just move to Python 3! You could also try to read this summary if you are especially masochistic.
The thing is about Python 3, is you can have Python 2 and Python 3 running on the same system. You have no good reason to only use Python 2!
print("Hello world")
instead of print "Hello world"
That is, more or less, the difference.
NOTE: If you didn’t use Anaconda to install Python, you should think about it. It’s a complete install that comes with all of the fun packages - numpy, scikit-learn, matplotlib, NLTK, IPython Notebook, etc - and none of the hassle.
After you do the following steps, your default Python will still be Python 2, but you’ll be able to run Python 3 using python3
. Python 3 will also be available as an option in your IPython Notebook.
Run the following commands from the command line.
1. Check and see if there will be a problem.
First, you’ll need to look at your PYTHONPATH
.
echo $PYTHONPATH
Did it display something like /usr/local/lib/python2.7/site-packages:
? Then you’re in trouble, and you should email me for a longer explanation. If it didn’t display anything, then you’re good to go with the rest of this.
2. Save a copy of the Python 2 kernel
The Python 2 kernel is the thing that runs Python 2 code (…kind of). We want to save a copy of it so that IPython Notebooks will always be able to use it.
ipython kernelspec install-self --user
3. Create a new environment that supports Python 3. This will cause Python 3 to be installed.
An ‘environment’ is just a specific version of Python with a bunch of libraries. By default you’re running everything on your system in one Python 2 environment. Anaconda can create a new environment that will install Python 3 for you.
conda create -n python3 python=3 anaconda
NOTE: You can also create environments for all sorts of other stuff, like maybe you have code that relies on Python 2.0.5 or a special matplotlib or something. If you’re especially responsible every project you create has its own environment! If you hear about
venv
orvirtualenv
that’s what people are talking about.
4. Switch to the Python 3 environment.
Even though you created the environment, you aren’t using it yet. Use the following command to switch to the Python 3 environment.
source activate python3
5. Save the the Python 3 kernel.
Now, just like we did before with Python 2, we’re going to save the Python 3 kernel so IPython Notebook can use it.
ipython kernelspec install-self --user
6. Switch away from the Python 3 environment, so you’re back to your default Python 2 environment.
And since we don’t want to default to Python 3 (…do we?), let’s just get out of the environment for right now.
source deactivate
7. Start up a new Terminal window
In order to have the changes take effect, you’ll need to close your terminal window and open up a new one. And then if you want to run some IPython Notebooks it’s the same old command:
ipython notebook
When you click New
it will give you the option of Python 3.
Additionally, if you want to run a Python 3 script from the commandline, use
python3 your_script.py
And you should be good to go!
NOTE: If you’re having problems with your kernel starting up, make sure you opened a new Terminal. You can also try restarting your computer.
8. Installing new packages
But!!! Let’s say you need a library like nameparser
that doesn’t come with Anaconda? You’ll need to hop into your Python 3 environment and install the libray:
source activate python3
pip install nameparser
source deactivate python3
It pops you into the Python 3 environment and uses the Python 3 version of pip
to install nameparser
. And then you should be good to go.
Think you’ve got it? Here’s a tiny quiz:
print
statement between Python 2 and Python 3?Answers are below. I’ve typed out the numbers so it’s a little harder to accidentally cheat.
print("hello world")
while Python 2 does print "hello world"
python3
You’ve finished learning absolutely everything about character encodings and hopefully you’re now an A+ expert international Python programmer.
Head back to the international data journalism homepage and maybe there are some other goodies there for you.