Processing change over time with pandas

Sometimes data is easy to graph and communicate, and sometimes it takes a few extra steps. Let’s take a look at visas-by-continent.csv in the data/ folder.

Open it without moving it.

%matplotlib inline
import pandas as pd
df = pd.read_csv("data/visas-by-continent.csv")
df.head()
year Africa Asia Europe North America Oceania South America Unknown
0 1997 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1 1998 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
2 1999 599452.0 4254634.0 2163808.0 3332712.0 105130.0 1883214.0 6238484.0
3 2000 655136.0 4912918.0 2190592.0 4723946.0 105670.0 1652472.0 7184174.0
4 2001 697870.0 4883414.0 2130078.0 5589696.0 111022.0 1730232.0 7624022.0

Checking our data

  • How many columns and rows does it have?
  • What is the data type of each column?
df.dtypes
year               int64
Africa           float64
Asia             float64
Europe           float64
North America    float64
Oceania          float64
South America    float64
Unknown          float64
dtype: object
df['year'] = pd.to_datetime(df['year'], format="%Y")
df.head()
year Africa Asia Europe North America Oceania South America Unknown
0 1997-01-01 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1 1998-01-01 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
2 1999-01-01 599452.0 4254634.0 2163808.0 3332712.0 105130.0 1883214.0 6238484.0
3 2000-01-01 655136.0 4912918.0 2190592.0 4723946.0 105670.0 1652472.0 7184174.0
4 2001-01-01 697870.0 4883414.0 2130078.0 5589696.0 111022.0 1730232.0 7624022.0
df.dtypes
year             datetime64[ns]
Africa                  float64
Asia                    float64
Europe                  float64
North America           float64
Oceania                 float64
South America           float64
Unknown                 float64
dtype: object
df.set_index('year', inplace=True)
df.head()
Africa Asia Europe North America Oceania South America Unknown
year
1997-01-01 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1998-01-01 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
1999-01-01 599452.0 4254634.0 2163808.0 3332712.0 105130.0 1883214.0 6238484.0
2000-01-01 655136.0 4912918.0 2190592.0 4723946.0 105670.0 1652472.0 7184174.0
2001-01-01 697870.0 4883414.0 2130078.0 5589696.0 111022.0 1730232.0 7624022.0

Making a simple sample plot

I worked on this data a bit for you, so it’s almost very easy to graph. Just take the dataframe and .plot() it.

df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10835ada0>

png

That x axis looks terrible, and “year” isn’t really a country. Let’s fix that.

Improving our data to improve our plot

We can improve the x axis and get rid of the ‘year’ column if we switch to a datetime index. This requires the following steps:

  • Convert the year into a datetime column,
  • Set it as our index,
  • and Graph it again

Be sure to specify the format when converting to a date.

How can we compare the change in visas between different continents when their populations are so much different?

Tough comparisons made easy with percent change

If we’re looking at one set of values that’s in the billions and another in the thousands, it’s very tough to see on a graphic.

Instead of looking at raw values, a better way of thinking about this is percent change. If one thing goes up 10% and the other goes up 50%, we can make a simple comparison!

You can do this in pandas by selecting a column and asking for .pct_change().

df['Oceania'].pct_change()
year
1997-01-01          NaN
1998-01-01     1.387662
1999-01-01    11.714450
2000-01-01     0.513650
2001-01-01     5.064824
2002-01-01   -15.431176
2003-01-01    -2.867185
2004-01-01     7.973859
2005-01-01    -3.170509
2006-01-01     1.843772
2007-01-01     2.986427
2008-01-01     6.171630
2009-01-01   -12.906628
2010-01-01     3.324178
2011-01-01     2.945117
2012-01-01     0.919054
2013-01-01     6.650817
2014-01-01     6.919938
2015-01-01     7.708889
2016-01-01    13.681102
Name: Oceania, dtype: float64
df['Oceania'].pct_change().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1086f63c8>

png

It’s missing the first value because it wasn’t changing from anything before! You can also easily plot the results.

import matplotlib.pyplot as plt
plt.style.use('ggplot')
df['Oceania'].pct_change().plot()
df['Asia'].pct_change().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x108901668>

png

df.pct_change().head()
Africa Asia Europe North America Oceania South America Unknown
year
1997-01-01 NaN NaN NaN NaN NaN NaN NaN
1998-01-01 0.063100 -0.143139 -0.040213 0.268923 0.013877 0.005183 -0.028842
1999-01-01 0.055109 0.038919 -0.015658 0.357353 0.117144 -0.132360 0.064419
2000-01-01 0.092892 0.154722 0.012378 0.417448 0.005136 -0.122526 0.151590
2001-01-01 0.065229 -0.006005 -0.027624 0.183268 0.050648 0.047057 0.061225
df.pct_change().plot(figsize=(20, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x1090a99e8>

png

Friendly, friendly data

If we look at our data, every column is a category we’d like to calculate percent change on. We can actually use df.pct_change() to do it for every single column.

Let’s graph it, and use figsize to make it big enough that the legend doesn’t get in the way of everything.

Long vs. wide data

Using pct_change is easy when we have our set of data as a column. This makes a lot of other calculations more difficult, though. Different analysis and graphing approaches require different formats of data.

The current format - where a column is a list of observations - is called wide format. The more subjects we have, the more columns we have.

wide_df = pd.read_csv("data/visas-by-continent.csv")
wide_df.set_index('year', inplace=True)
wide_df.head()
Africa Asia Europe North America Oceania South America Unknown
year
1997 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1998 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
1999 599452.0 4254634.0 2163808.0 3332712.0 105130.0 1883214.0 6238484.0
2000 655136.0 4912918.0 2190592.0 4723946.0 105670.0 1652472.0 7184174.0
2001 697870.0 4883414.0 2130078.0 5589696.0 111022.0 1730232.0 7624022.0
df.sum(axis=1)
year
1997-01-01    17826183.00
1998-01-01    17442459.00
1999-01-01    18577434.00
2000-01-01    21424908.00
2001-01-01    22766334.00
2002-01-01    17308311.00
2003-01-01    14644902.00
2004-01-01    15147297.00
2005-01-01    16166853.00
2006-01-01    17510190.00
2007-01-01    19332855.00
2008-01-01    19809228.00
2009-01-01    17412546.00
2010-01-01    19268253.00
2011-01-01    22523817.00
2012-01-01    26781270.00
2013-01-01    27493047.00
2014-01-01    29797440.00
2015-01-01    32675235.72
2016-01-01    31144473.00
dtype: float64

The other option is called long data, where every row is an observation with a category variable.

long_df = pd.read_csv("data/visas-by-continent-long.csv")
long_df.head()
year continent visas
0 1997 Africa 534420.0
1 1998 Africa 568142.0
2 1999 Africa 599452.0
3 2000 Africa 655136.0
4 2001 Africa 697870.0

Converting long data to wide data

You can convert from long data to wide data using .pivot

# Reminder of what long data looks like
long_df.head(5)
year continent visas
0 1997 Africa 534420.0
1 1998 Africa 568142.0
2 1999 Africa 599452.0
3 2000 Africa 655136.0
4 2001 Africa 697870.0
# Turn the year into an index
# And make columns out of all of the values in 'continent'
# and turn the visas column into the cells
long_df.pivot(index='year', columns='continent', values='visas').head()
continent Africa Asia Europe North America Oceania South America Unknown
year
1997 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1998 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
1999 599452.0 4254634.0 2163808.0 3332712.0 105130.0 1883214.0 6238484.0
2000 655136.0 4912918.0 2190592.0 4723946.0 105670.0 1652472.0 7184174.0
2001 697870.0 4883414.0 2130078.0 5589696.0 111022.0 1730232.0 7624022.0

Converting wide data to long data

You can convert from wide data to long data using .melt

# Let's remind ourselves what it looks like
wide_df.head(2)
Africa Asia Europe North America Oceania South America Unknown
year
1997 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1998 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
# Bring the year out as a real column again
wide_df.head(2).reset_index()
year Africa Asia Europe North America Oceania South America Unknown
0 1997 534420.0 4779368.0 2290328.0 1934950.0 92818.0 2159310.0 6034989.0
1 1998 568142.0 4095252.0 2198228.0 2455302.0 94106.0 2170502.0 5860927.0
# convert the column into the variable "CONTINENT"
# and the value in the cell to the column "VISAS"
# and separate them based on the "year"
wide_df.reset_index().melt(var_name='continent', value_name='visas', id_vars=['year'])
year continent visas
0 1997 Africa 534420.00
1 1998 Africa 568142.00
2 1999 Africa 599452.00
3 2000 Africa 655136.00
4 2001 Africa 697870.00
5 2002 Africa 498682.00
6 2003 Africa 478112.00
7 2004 Africa 501824.00
8 2005 Africa 505656.00
9 2006 Africa 509844.00
10 2007 Africa 553236.00
11 2008 Africa 624568.00
12 2009 Africa 582636.00
13 2010 Africa 607702.00
14 2011 Africa 651340.00
15 2012 Africa 774964.00
16 2013 Africa 807014.00
17 2014 Africa 939464.00
18 2015 Africa 987590.00
19 2016 Africa 1102308.00
20 1997 Asia 4779368.00
21 1998 Asia 4095252.00
22 1999 Asia 4254634.00
23 2000 Asia 4912918.00
24 2001 Asia 4883414.00
25 2002 Asia 3709890.00
26 2003 Asia 3409702.00
27 2004 Asia 3679070.00
28 2005 Asia 4059790.00
29 2006 Asia 4491056.00
... ... ... ...
110 2007 South America 1972496.00
111 2008 South America 2205672.00
112 2009 South America 2312754.00
113 2010 South America 2515038.00
114 2011 South America 3397106.00
115 2012 South America 4353292.00
116 2013 South America 4541802.00
117 2014 South America 4686358.00
118 2015 South America 4345696.00
119 2016 South America 3515940.00
120 1997 Unknown 6034989.00
121 1998 Unknown 5860927.00
122 1999 Unknown 6238484.00
123 2000 Unknown 7184174.00
124 2001 Unknown 7624022.00
125 2002 Unknown 5796777.00
126 2003 Unknown 4897012.00
127 2004 Unknown 5057073.00
128 2005 Unknown 5395153.00
129 2006 Unknown 5842172.00
130 2007 Unknown 6450295.00
131 2008 Unknown 6609864.00
132 2009 Unknown 5809512.00
133 2010 Unknown 6428425.00
134 2011 Unknown 7514001.00
135 2012 Unknown 8933756.00
136 2013 Unknown 9172705.00
137 2014 Unknown 9941932.00
138 2015 Unknown 10898813.24
139 2016 Unknown 10388767.00

140 rows × 3 columns