← back to class-06

Opening a Swiss CSV in pandas

The Error

In this example, we will open the Grundbuch- und Notariatsverwaltung - Geschäftsfallstatistik Bereich Grundbuch from the official opendata.swiss open data website.

When you first try to open a dataset, always try plain .read_csv first. In this case, we will get an error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
df.head()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_65887/3862338470.py in <module>
      1 import pandas as pd
      2 
----> 3 df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
      4 df.head()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte

If we look all the way at the bottom we see the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte.

The Fix for UnicodeDecodeError 'utf-8' codec can't decode byte

To fix this issue, we need to set the character encoding to ISO 8859-1/Latin-1. By default pandas tries to use Unicode/UTF-8, but that's not the right match for this dataset.

We also need to tell it to use the ; semicolon as the separator, even though it's named as a CSV file!

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-1", sep=';')
df.head()
JAHR MONAT BEZIRK_NUMMER_BFS BEZIRK_NAME BEURKUNDUNGEN TAGEBUCHEINTRAEGE HANDAENDERUNGEN HOEHE_DER_KAUFSUMME NEUE_GRUNDSTUECKE DIENSTBARKEITEN VORMERKUNGEN ANMERKUNGEN
0 2017 Januar 2011 Arbon 161 315 91 58069073.65 39 25 15 42
1 2017 Februar 2011 Arbon 123 255 80 42692836.00 4 14 4 22
2 2017 März 2011 Arbon 211 448 114 46158136.40 146 33 58 37
3 2017 April 2011 Arbon 105 206 62 37412230.00 39 16 7 28
4 2017 Mai 2011 Arbon 141 252 74 48671315.00 46 11 20 35

The explanation for UnicodeDecodeError 'utf-8' codec can't decode byte

Computers only understand numbers, not text. To allow computers to read things, this thing called character encoding where every letter is given a number. That way the computer can understand it!

Here's an example:

character number (encoding)
A 65
B 66
a 97
b 98
& 38

Previously, each language or country had its own list of letters and numbers. For example, English doesn't need a ř, so ř didn't have a number. It's like the character didn't exist!

That system wasn't very convenient when you want to talk across borders. These days most people use a character encoding called Unicode or UTF-8 that lets your computer know about every character everywhere. It lets you computer talk about bread or パン or хлеб easily!

By default, pandas assumes your data is Unicode/UTF-8. If not: surprise, you get an error! As pandas tried to use Unicode to decode numbers into letters, it got something that just didn't make sense. This tells your your data probably isn't using UTF-8.

Unfortunately, when working with a CSV file you just have to guess what the correct encoding it.

For Swiss government datasets (and most of Western Europe), you will generally use ISO 8859-1 (also known as Latin-1). This is a character encoding that has most of the characters that Western Europe needs, but not much else.

To specify an encoding for pandas, pass encoding= when using .read_csv.

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2")
df.head()
JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;BEURKUNDUNGEN;TAGEBUCHEINTRAEGE;HANDAENDERUNGEN;HOEHE_DER_KAUFSUMME;NEUE_GRUNDSTUECKE;DIENSTBARKEITEN;VORMERKUNGEN;ANMERKUNGEN
0 2017;Januar;2011;Arbon;161;315;91;58069073.65;...
1 2017;Februar;2011;Arbon;123;255;80;42692836;4;...
2 2017;März;2011;Arbon;211;448;114;46158136.4;14...
3 2017;April;2011;Arbon;105;206;62;37412230;39;1...
4 2017;Mai;2011;Arbon;141;252;74;48671315;46;11;...

Another improvement: Using a semicolon as a separator

That output doesn't look right. But why? Because of CSVs that don't use commas!

CSV stands for comma-separated values and normally looks like this:

name,species
Jack,cat
Mulberry,cat
Rover,dog

For some reason, though, much of the world uses semicolons to separate the values instead of commas. In those situations, our dataset would look like this:

name;species
Jack;cat
Mulberry;cat
Rover;dog

Let's look at what pandas shows us, especially the header row - JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;.... There we go! It's because we probably need to use a semicolon separator instead of a common.

To use a semicolon as a separator, just pass sep=';' when using pd.read_csv.

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2", sep=';')
df.head()
JAHR MONAT BEZIRK_NUMMER_BFS BEZIRK_NAME BEURKUNDUNGEN TAGEBUCHEINTRAEGE HANDAENDERUNGEN HOEHE_DER_KAUFSUMME NEUE_GRUNDSTUECKE DIENSTBARKEITEN VORMERKUNGEN ANMERKUNGEN
0 2017 Januar 2011 Arbon 161 315 91 58069073.65 39 25 15 42
1 2017 Februar 2011 Arbon 123 255 80 42692836.00 4 14 4 22
2 2017 März 2011 Arbon 211 448 114 46158136.40 146 33 58 37
3 2017 April 2011 Arbon 105 206 62 37412230.00 39 16 7 28
4 2017 Mai 2011 Arbon 141 252 74 48671315.00 46 11 20 35

Perfect!

← back to class-06