Opening a Swiss CSV in pandas
The Error
In this example, we will open the Grundbuch- und Notariatsverwaltung - Geschäftsfallstatistik Bereich Grundbuch from the official opendata.swiss open data website.
When you first try to open a dataset, always try plain .read_csv
first. In this case, we will get an error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte
import pandas as pd
= pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
df df.head()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_65887/3862338470.py in <module>
1 import pandas as pd
2
----> 3 df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
4 df.head()
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header()
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte
If we look all the way at the bottom we see the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte
.
The Fix for UnicodeDecodeError 'utf-8' codec can't decode byte
To fix this issue, we need to set the character encoding to ISO 8859-1/Latin-1. By default pandas tries to use Unicode/UTF-8, but that's not the right match for this dataset.
We also need to tell it to use the ;
semicolon as the separator, even though it's named as a CSV file!
import pandas as pd
= pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-1", sep=';')
df df.head()
JAHR | MONAT | BEZIRK_NUMMER_BFS | BEZIRK_NAME | BEURKUNDUNGEN | TAGEBUCHEINTRAEGE | HANDAENDERUNGEN | HOEHE_DER_KAUFSUMME | NEUE_GRUNDSTUECKE | DIENSTBARKEITEN | VORMERKUNGEN | ANMERKUNGEN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017 | Januar | 2011 | Arbon | 161 | 315 | 91 | 58069073.65 | 39 | 25 | 15 | 42 |
1 | 2017 | Februar | 2011 | Arbon | 123 | 255 | 80 | 42692836.00 | 4 | 14 | 4 | 22 |
2 | 2017 | März | 2011 | Arbon | 211 | 448 | 114 | 46158136.40 | 146 | 33 | 58 | 37 |
3 | 2017 | April | 2011 | Arbon | 105 | 206 | 62 | 37412230.00 | 39 | 16 | 7 | 28 |
4 | 2017 | Mai | 2011 | Arbon | 141 | 252 | 74 | 48671315.00 | 46 | 11 | 20 | 35 |
The explanation for UnicodeDecodeError 'utf-8' codec can't decode byte
Computers only understand numbers, not text. To allow computers to read things, this thing called character encoding where every letter is given a number. That way the computer can understand it!
Here's an example:
character | number (encoding) |
---|---|
A | 65 |
B | 66 |
a | 97 |
b | 98 |
& | 38 |
Previously, each language or country had its own list of letters and numbers. For example, English doesn't need a ř, so ř didn't have a number. It's like the character didn't exist!
That system wasn't very convenient when you want to talk across borders. These days most people use a character encoding called Unicode or UTF-8 that lets your computer know about every character everywhere. It lets you computer talk about bread or パン or хлеб easily!
By default, pandas assumes your data is Unicode/UTF-8. If not: surprise, you get an error! As pandas tried to use Unicode to decode numbers into letters, it got something that just didn't make sense. This tells your your data probably isn't using UTF-8.
Unfortunately, when working with a CSV file you just have to guess what the correct encoding it.
For Swiss government datasets (and most of Western Europe), you will generally use ISO 8859-1 (also known as Latin-1). This is a character encoding that has most of the characters that Western Europe needs, but not much else.
To specify an encoding for pandas, pass encoding=
when using .read_csv
.
import pandas as pd
= pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2")
df df.head()
JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;BEURKUNDUNGEN;TAGEBUCHEINTRAEGE;HANDAENDERUNGEN;HOEHE_DER_KAUFSUMME;NEUE_GRUNDSTUECKE;DIENSTBARKEITEN;VORMERKUNGEN;ANMERKUNGEN | |
---|---|
0 | 2017;Januar;2011;Arbon;161;315;91;58069073.65;... |
1 | 2017;Februar;2011;Arbon;123;255;80;42692836;4;... |
2 | 2017;März;2011;Arbon;211;448;114;46158136.4;14... |
3 | 2017;April;2011;Arbon;105;206;62;37412230;39;1... |
4 | 2017;Mai;2011;Arbon;141;252;74;48671315;46;11;... |
Another improvement: Using a semicolon as a separator
That output doesn't look right. But why? Because of CSVs that don't use commas!
CSV stands for comma-separated values and normally looks like this:
name,species
Jack,cat
Mulberry,cat
Rover,dog
For some reason, though, much of the world uses semicolons to separate the values instead of commas. In those situations, our dataset would look like this:
name;species
Jack;cat
Mulberry;cat
Rover;dog
Let's look at what pandas shows us, especially the header row - JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;...
. There we go! It's because we probably need to use a semicolon separator instead of a common.
To use a semicolon as a separator, just pass sep=';'
when using pd.read_csv
.
import pandas as pd
= pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2", sep=';')
df df.head()
JAHR | MONAT | BEZIRK_NUMMER_BFS | BEZIRK_NAME | BEURKUNDUNGEN | TAGEBUCHEINTRAEGE | HANDAENDERUNGEN | HOEHE_DER_KAUFSUMME | NEUE_GRUNDSTUECKE | DIENSTBARKEITEN | VORMERKUNGEN | ANMERKUNGEN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017 | Januar | 2011 | Arbon | 161 | 315 | 91 | 58069073.65 | 39 | 25 | 15 | 42 |
1 | 2017 | Februar | 2011 | Arbon | 123 | 255 | 80 | 42692836.00 | 4 | 14 | 4 | 22 |
2 | 2017 | März | 2011 | Arbon | 211 | 448 | 114 | 46158136.40 | 146 | 33 | 58 | 37 |
3 | 2017 | April | 2011 | Arbon | 105 | 206 | 62 | 37412230.00 | 39 | 16 | 7 | 28 |
4 | 2017 | Mai | 2011 | Arbon | 141 | 252 | 74 | 48671315.00 | 46 | 11 | 20 | 35 |
Perfect!