Opening a Swiss CSV in pandas

The Error

In this example, we will open the Grundbuch- und Notariatsverwaltung - Geschäftsfallstatistik Bereich Grundbuch from the official opendata.swiss open data website.

When you first try to open a dataset, always try plain .read_csv first. In this case, we will get an error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
df.head()

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_65887/3862338470.py in <module>
      1 import pandas as pd
      2 
----> 3 df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv")
      4 df.head()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte

If we look all the way at the bottom we see the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 289: invalid continuation byte.

The Fix for UnicodeDecodeError 'utf-8' codec can't decode byte

To fix this issue, we need to set the character encoding to ISO 8859-1/Latin-1. By default pandas tries to use Unicode/UTF-8, but that's not the right match for this dataset.

We also need to tell it to use the ; semicolon as the separator, even though it's named as a CSV file!

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-1", sep=';')
df.head()

	JAHR	MONAT	BEZIRK_NUMMER_BFS	BEZIRK_NAME	BEURKUNDUNGEN	TAGEBUCHEINTRAEGE	HANDAENDERUNGEN	HOEHE_DER_KAUFSUMME	NEUE_GRUNDSTUECKE	DIENSTBARKEITEN	VORMERKUNGEN	ANMERKUNGEN
0	2017	Januar	2011	Arbon	161	315	91	58069073.65	39	25	15	42
1	2017	Februar	2011	Arbon	123	255	80	42692836.00	4	14	4	22
2	2017	März	2011	Arbon	211	448	114	46158136.40	146	33	58	37
3	2017	April	2011	Arbon	105	206	62	37412230.00	39	16	7	28
4	2017	Mai	2011	Arbon	141	252	74	48671315.00	46	11	20	35

The explanation for UnicodeDecodeError 'utf-8' codec can't decode byte

Computers only understand numbers, not text. To allow computers to read things, this thing called character encoding where every letter is given a number. That way the computer can understand it!

Here's an example:

character	number (encoding)
A	65
B	66
a	97
b	98
&	38

Previously, each language or country had its own list of letters and numbers. For example, English doesn't need a ř, so ř didn't have a number. It's like the character didn't exist!

That system wasn't very convenient when you want to talk across borders. These days most people use a character encoding called Unicode or UTF-8 that lets your computer know about every character everywhere. It lets you computer talk about bread or パン or хлеб easily!

By default, pandas assumes your data is Unicode/UTF-8. If not: surprise, you get an error! As pandas tried to use Unicode to decode numbers into letters, it got something that just didn't make sense. This tells your your data probably isn't using UTF-8.

Unfortunately, when working with a CSV file you just have to guess what the correct encoding it.

For Swiss government datasets (and most of Western Europe), you will generally use ISO 8859-1 (also known as Latin-1). This is a character encoding that has most of the characters that Western Europe needs, but not much else.

To specify an encoding for pandas, pass encoding= when using .read_csv.

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2")
df.head()

	JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;BEURKUNDUNGEN;TAGEBUCHEINTRAEGE;HANDAENDERUNGEN;HOEHE_DER_KAUFSUMME;NEUE_GRUNDSTUECKE;DIENSTBARKEITEN;VORMERKUNGEN;ANMERKUNGEN
0	2017;Januar;2011;Arbon;161;315;91;58069073.65;...
1	2017;Februar;2011;Arbon;123;255;80;42692836;4;...
2	2017;März;2011;Arbon;211;448;114;46158136.4;14...
3	2017;April;2011;Arbon;105;206;62;37412230;39;1...
4	2017;Mai;2011;Arbon;141;252;74;48671315;46;11;...

Another improvement: Using a semicolon as a separator

That output doesn't look right. But why? Because of CSVs that don't use commas!

CSV stands for comma-separated values and normally looks like this:

name,species
Jack,cat
Mulberry,cat
Rover,dog

For some reason, though, much of the world uses semicolons to separate the values instead of commas. In those situations, our dataset would look like this:

name;species
Jack;cat
Mulberry;cat
Rover;dog

Let's look at what pandas shows us, especially the header row - JAHR;MONAT;BEZIRK_NUMMER_BFS;BEZIRK_NAME;.... There we go! It's because we probably need to use a semicolon separator instead of a common.

To use a semicolon as a separator, just pass sep=';' when using pd.read_csv.

import pandas as pd

df = pd.read_csv("Geschaeftsfaelle_Grundbuch_ab2017.csv", encoding="iso8859-2", sep=';')
df.head()

	JAHR	MONAT	BEZIRK_NUMMER_BFS	BEZIRK_NAME	BEURKUNDUNGEN	TAGEBUCHEINTRAEGE	HANDAENDERUNGEN	HOEHE_DER_KAUFSUMME	NEUE_GRUNDSTUECKE	DIENSTBARKEITEN	VORMERKUNGEN	ANMERKUNGEN
0	2017	Januar	2011	Arbon	161	315	91	58069073.65	39	25	15	42
1	2017	Februar	2011	Arbon	123	255	80	42692836.00	4	14	4	22
2	2017	März	2011	Arbon	211	448	114	46158136.40	146	33	58	37
3	2017	April	2011	Arbon	105	206	62	37412230.00	39	16	7	28
4	2017	Mai	2011	Arbon	141	252	74	48671315.00	46	11	20	35

Perfect!

← back to class-06