Processing change over time with pandas

Sometimes data is easy to graph and communicate, and sometimes it takes a few extra steps. Let’s take a look at visas-by-continent.csv in the data/ folder.

Open it without moving it.

%matplotlib inline
import pandas as pd

df = pd.read_csv("data/visas-by-continent.csv")
df.head()

	year	Africa	Asia	Europe	North America	Oceania	South America	Unknown
0	1997	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1	1998	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0
2	1999	599452.0	4254634.0	2163808.0	3332712.0	105130.0	1883214.0	6238484.0
3	2000	655136.0	4912918.0	2190592.0	4723946.0	105670.0	1652472.0	7184174.0
4	2001	697870.0	4883414.0	2130078.0	5589696.0	111022.0	1730232.0	7624022.0

Checking our data

How many columns and rows does it have?
What is the data type of each column?

df.dtypes

year               int64
Africa           float64
Asia             float64
Europe           float64
North America    float64
Oceania          float64
South America    float64
Unknown          float64
dtype: object

df['year'] = pd.to_datetime(df['year'], format="%Y")
df.head()

	year	Africa	Asia	Europe	North America	Oceania	South America	Unknown
0	1997-01-01	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1	1998-01-01	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0
2	1999-01-01	599452.0	4254634.0	2163808.0	3332712.0	105130.0	1883214.0	6238484.0
3	2000-01-01	655136.0	4912918.0	2190592.0	4723946.0	105670.0	1652472.0	7184174.0
4	2001-01-01	697870.0	4883414.0	2130078.0	5589696.0	111022.0	1730232.0	7624022.0

df.dtypes

year             datetime64[ns]
Africa                  float64
Asia                    float64
Europe                  float64
North America           float64
Oceania                 float64
South America           float64
Unknown                 float64
dtype: object

df.set_index('year', inplace=True)

df.head()

	Africa	Asia	Europe	North America	Oceania	South America	Unknown
year
1997-01-01	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1998-01-01	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0
1999-01-01	599452.0	4254634.0	2163808.0	3332712.0	105130.0	1883214.0	6238484.0
2000-01-01	655136.0	4912918.0	2190592.0	4723946.0	105670.0	1652472.0	7184174.0
2001-01-01	697870.0	4883414.0	2130078.0	5589696.0	111022.0	1730232.0	7624022.0

Making a simple sample plot

I worked on this data a bit for you, so it’s almost very easy to graph. Just take the dataframe and .plot() it.

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x10835ada0>

png

That x axis looks terrible, and “year” isn’t really a country. Let’s fix that.

Improving our data to improve our plot

We can improve the x axis and get rid of the ‘year’ column if we switch to a datetime index. This requires the following steps:

Convert the year into a datetime column,
Set it as our index,
and Graph it again

Be sure to specify the format when converting to a date.

How can we compare the change in visas between different continents when their populations are so much different?

Tough comparisons made easy with percent change

If we’re looking at one set of values that’s in the billions and another in the thousands, it’s very tough to see on a graphic.

Instead of looking at raw values, a better way of thinking about this is percent change. If one thing goes up 10% and the other goes up 50%, we can make a simple comparison!

You can do this in pandas by selecting a column and asking for .pct_change().

df['Oceania'].pct_change()

year
1997-01-01          NaN
1998-01-01     1.387662
1999-01-01    11.714450
2000-01-01     0.513650
2001-01-01     5.064824
2002-01-01   -15.431176
2003-01-01    -2.867185
2004-01-01     7.973859
2005-01-01    -3.170509
2006-01-01     1.843772
2007-01-01     2.986427
2008-01-01     6.171630
2009-01-01   -12.906628
2010-01-01     3.324178
2011-01-01     2.945117
2012-01-01     0.919054
2013-01-01     6.650817
2014-01-01     6.919938
2015-01-01     7.708889
2016-01-01    13.681102
Name: Oceania, dtype: float64

df['Oceania'].pct_change().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1086f63c8>

png

It’s missing the first value because it wasn’t changing from anything before! You can also easily plot the results.

import matplotlib.pyplot as plt
plt.style.use('ggplot')

df['Oceania'].pct_change().plot()
df['Asia'].pct_change().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x108901668>

png

df.pct_change().head()

	Africa	Asia	Europe	North America	Oceania	South America	Unknown
year
1997-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1998-01-01	0.063100	-0.143139	-0.040213	0.268923	0.013877	0.005183	-0.028842
1999-01-01	0.055109	0.038919	-0.015658	0.357353	0.117144	-0.132360	0.064419
2000-01-01	0.092892	0.154722	0.012378	0.417448	0.005136	-0.122526	0.151590
2001-01-01	0.065229	-0.006005	-0.027624	0.183268	0.050648	0.047057	0.061225

df.pct_change().plot(figsize=(20, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x1090a99e8>

png

Friendly, friendly data

If we look at our data, every column is a category we’d like to calculate percent change on. We can actually use df.pct_change() to do it for every single column.

Let’s graph it, and use figsize to make it big enough that the legend doesn’t get in the way of everything.

Long vs. wide data

Using pct_change is easy when we have our set of data as a column. This makes a lot of other calculations more difficult, though. Different analysis and graphing approaches require different formats of data.

The current format - where a column is a list of observations - is called wide format. The more subjects we have, the more columns we have.

wide_df = pd.read_csv("data/visas-by-continent.csv")
wide_df.set_index('year', inplace=True)
wide_df.head()

	Africa	Asia	Europe	North America	Oceania	South America	Unknown
year
1997	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1998	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0
1999	599452.0	4254634.0	2163808.0	3332712.0	105130.0	1883214.0	6238484.0
2000	655136.0	4912918.0	2190592.0	4723946.0	105670.0	1652472.0	7184174.0
2001	697870.0	4883414.0	2130078.0	5589696.0	111022.0	1730232.0	7624022.0

df.sum(axis=1)

year
1997-01-01    17826183.00
1998-01-01    17442459.00
1999-01-01    18577434.00
2000-01-01    21424908.00
2001-01-01    22766334.00
2002-01-01    17308311.00
2003-01-01    14644902.00
2004-01-01    15147297.00
2005-01-01    16166853.00
2006-01-01    17510190.00
2007-01-01    19332855.00
2008-01-01    19809228.00
2009-01-01    17412546.00
2010-01-01    19268253.00
2011-01-01    22523817.00
2012-01-01    26781270.00
2013-01-01    27493047.00
2014-01-01    29797440.00
2015-01-01    32675235.72
2016-01-01    31144473.00
dtype: float64

The other option is called long data, where every row is an observation with a category variable.

long_df = pd.read_csv("data/visas-by-continent-long.csv")
long_df.head()

	year	continent	visas
0	1997	Africa	534420.0
1	1998	Africa	568142.0
2	1999	Africa	599452.0
3	2000	Africa	655136.0
4	2001	Africa	697870.0

Converting long data to wide data

You can convert from long data to wide data using .pivot

# Reminder of what long data looks like
long_df.head(5)

	year	continent	visas
0	1997	Africa	534420.0
1	1998	Africa	568142.0
2	1999	Africa	599452.0
3	2000	Africa	655136.0
4	2001	Africa	697870.0

# Turn the year into an index
# And make columns out of all of the values in 'continent'
# and turn the visas column into the cells
long_df.pivot(index='year', columns='continent', values='visas').head()

continent	Africa	Asia	Europe	North America	Oceania	South America	Unknown
year
1997	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1998	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0
1999	599452.0	4254634.0	2163808.0	3332712.0	105130.0	1883214.0	6238484.0
2000	655136.0	4912918.0	2190592.0	4723946.0	105670.0	1652472.0	7184174.0
2001	697870.0	4883414.0	2130078.0	5589696.0	111022.0	1730232.0	7624022.0

Converting wide data to long data

You can convert from wide data to long data using .melt

# Let's remind ourselves what it looks like
wide_df.head(2)

	Africa	Asia	Europe	North America	Oceania	South America	Unknown
year
1997	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1998	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0

# Bring the year out as a real column again
wide_df.head(2).reset_index()

	year	Africa	Asia	Europe	North America	Oceania	South America	Unknown
0	1997	534420.0	4779368.0	2290328.0	1934950.0	92818.0	2159310.0	6034989.0
1	1998	568142.0	4095252.0	2198228.0	2455302.0	94106.0	2170502.0	5860927.0

# convert the column into the variable "CONTINENT"
# and the value in the cell to the column "VISAS"
# and separate them based on the "year"
wide_df.reset_index().melt(var_name='continent', value_name='visas', id_vars=['year'])

	year	continent	visas
0	1997	Africa	534420.00
1	1998	Africa	568142.00
2	1999	Africa	599452.00
3	2000	Africa	655136.00
4	2001	Africa	697870.00
5	2002	Africa	498682.00
6	2003	Africa	478112.00
7	2004	Africa	501824.00
8	2005	Africa	505656.00
9	2006	Africa	509844.00
10	2007	Africa	553236.00
11	2008	Africa	624568.00
12	2009	Africa	582636.00
13	2010	Africa	607702.00
14	2011	Africa	651340.00
15	2012	Africa	774964.00
16	2013	Africa	807014.00
17	2014	Africa	939464.00
18	2015	Africa	987590.00
19	2016	Africa	1102308.00
20	1997	Asia	4779368.00
21	1998	Asia	4095252.00
22	1999	Asia	4254634.00
23	2000	Asia	4912918.00
24	2001	Asia	4883414.00
25	2002	Asia	3709890.00
26	2003	Asia	3409702.00
27	2004	Asia	3679070.00
28	2005	Asia	4059790.00
29	2006	Asia	4491056.00
...	...	...	...
110	2007	South America	1972496.00
111	2008	South America	2205672.00
112	2009	South America	2312754.00
113	2010	South America	2515038.00
114	2011	South America	3397106.00
115	2012	South America	4353292.00
116	2013	South America	4541802.00
117	2014	South America	4686358.00
118	2015	South America	4345696.00
119	2016	South America	3515940.00
120	1997	Unknown	6034989.00
121	1998	Unknown	5860927.00
122	1999	Unknown	6238484.00
123	2000	Unknown	7184174.00
124	2001	Unknown	7624022.00
125	2002	Unknown	5796777.00
126	2003	Unknown	4897012.00
127	2004	Unknown	5057073.00
128	2005	Unknown	5395153.00
129	2006	Unknown	5842172.00
130	2007	Unknown	6450295.00
131	2008	Unknown	6609864.00
132	2009	Unknown	5809512.00
133	2010	Unknown	6428425.00
134	2011	Unknown	7514001.00
135	2012	Unknown	8933756.00
136	2013	Unknown	9172705.00
137	2014	Unknown	9941932.00
138	2015	Unknown	10898813.24
139	2016	Unknown	10388767.00

140 rows × 3 columns