Processing change over time with pandas
Sometimes data is easy to graph and communicate, and sometimes it takes a few
extra steps. Let’s take a look at visas-by-continent.csv
in the data/
folder.
Open it without moving it.
year | Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|---|
0 | 1997 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1 | 1998 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
2 | 1999 | 599452.0 | 4254634.0 | 2163808.0 | 3332712.0 | 105130.0 | 1883214.0 | 6238484.0 |
3 | 2000 | 655136.0 | 4912918.0 | 2190592.0 | 4723946.0 | 105670.0 | 1652472.0 | 7184174.0 |
4 | 2001 | 697870.0 | 4883414.0 | 2130078.0 | 5589696.0 | 111022.0 | 1730232.0 | 7624022.0 |
Checking our data
- How many columns and rows does it have?
- What is the data type of each column?
year int64
Africa float64
Asia float64
Europe float64
North America float64
Oceania float64
South America float64
Unknown float64
dtype: object
year | Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|---|
0 | 1997-01-01 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1 | 1998-01-01 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
2 | 1999-01-01 | 599452.0 | 4254634.0 | 2163808.0 | 3332712.0 | 105130.0 | 1883214.0 | 6238484.0 |
3 | 2000-01-01 | 655136.0 | 4912918.0 | 2190592.0 | 4723946.0 | 105670.0 | 1652472.0 | 7184174.0 |
4 | 2001-01-01 | 697870.0 | 4883414.0 | 2130078.0 | 5589696.0 | 111022.0 | 1730232.0 | 7624022.0 |
year datetime64[ns]
Africa float64
Asia float64
Europe float64
North America float64
Oceania float64
South America float64
Unknown float64
dtype: object
Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|
year | |||||||
1997-01-01 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1998-01-01 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
1999-01-01 | 599452.0 | 4254634.0 | 2163808.0 | 3332712.0 | 105130.0 | 1883214.0 | 6238484.0 |
2000-01-01 | 655136.0 | 4912918.0 | 2190592.0 | 4723946.0 | 105670.0 | 1652472.0 | 7184174.0 |
2001-01-01 | 697870.0 | 4883414.0 | 2130078.0 | 5589696.0 | 111022.0 | 1730232.0 | 7624022.0 |
Making a simple sample plot
I worked on this data a bit for you, so it’s almost very easy to graph. Just
take the dataframe and .plot()
it.
<matplotlib.axes._subplots.AxesSubplot at 0x10835ada0>
That x axis looks terrible, and “year” isn’t really a country. Let’s fix that.
Improving our data to improve our plot
We can improve the x axis and get rid of the ‘year’ column if we switch to a datetime index. This requires the following steps:
- Convert the year into a datetime column,
- Set it as our index,
- and Graph it again
Be sure to specify the format when converting to a date.
How can we compare the change in visas between different continents when their populations are so much different?
Tough comparisons made easy with percent change
If we’re looking at one set of values that’s in the billions and another in the thousands, it’s very tough to see on a graphic.
Instead of looking at raw values, a better way of thinking about this is percent change. If one thing goes up 10% and the other goes up 50%, we can make a simple comparison!
You can do this in pandas by selecting a column and asking for .pct_change()
.
year
1997-01-01 NaN
1998-01-01 1.387662
1999-01-01 11.714450
2000-01-01 0.513650
2001-01-01 5.064824
2002-01-01 -15.431176
2003-01-01 -2.867185
2004-01-01 7.973859
2005-01-01 -3.170509
2006-01-01 1.843772
2007-01-01 2.986427
2008-01-01 6.171630
2009-01-01 -12.906628
2010-01-01 3.324178
2011-01-01 2.945117
2012-01-01 0.919054
2013-01-01 6.650817
2014-01-01 6.919938
2015-01-01 7.708889
2016-01-01 13.681102
Name: Oceania, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x1086f63c8>
It’s missing the first value because it wasn’t changing from anything before! You can also easily plot the results.
<matplotlib.axes._subplots.AxesSubplot at 0x108901668>
Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|
year | |||||||
1997-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1998-01-01 | 0.063100 | -0.143139 | -0.040213 | 0.268923 | 0.013877 | 0.005183 | -0.028842 |
1999-01-01 | 0.055109 | 0.038919 | -0.015658 | 0.357353 | 0.117144 | -0.132360 | 0.064419 |
2000-01-01 | 0.092892 | 0.154722 | 0.012378 | 0.417448 | 0.005136 | -0.122526 | 0.151590 |
2001-01-01 | 0.065229 | -0.006005 | -0.027624 | 0.183268 | 0.050648 | 0.047057 | 0.061225 |
<matplotlib.axes._subplots.AxesSubplot at 0x1090a99e8>
Friendly, friendly data
If we look at our data, every column is a category we’d like to calculate
percent change on. We can actually use df.pct_change()
to do it for every
single column.
Let’s graph it, and use figsize
to make it big enough that the legend doesn’t
get in the way of everything.
Long vs. wide data
Using pct_change
is easy when we have our set of data as a column. This makes
a lot of other calculations more difficult, though. Different analysis and
graphing approaches require different formats of data.
The current format - where a column is a list of observations - is called wide format. The more subjects we have, the more columns we have.
Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|
year | |||||||
1997 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1998 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
1999 | 599452.0 | 4254634.0 | 2163808.0 | 3332712.0 | 105130.0 | 1883214.0 | 6238484.0 |
2000 | 655136.0 | 4912918.0 | 2190592.0 | 4723946.0 | 105670.0 | 1652472.0 | 7184174.0 |
2001 | 697870.0 | 4883414.0 | 2130078.0 | 5589696.0 | 111022.0 | 1730232.0 | 7624022.0 |
year
1997-01-01 17826183.00
1998-01-01 17442459.00
1999-01-01 18577434.00
2000-01-01 21424908.00
2001-01-01 22766334.00
2002-01-01 17308311.00
2003-01-01 14644902.00
2004-01-01 15147297.00
2005-01-01 16166853.00
2006-01-01 17510190.00
2007-01-01 19332855.00
2008-01-01 19809228.00
2009-01-01 17412546.00
2010-01-01 19268253.00
2011-01-01 22523817.00
2012-01-01 26781270.00
2013-01-01 27493047.00
2014-01-01 29797440.00
2015-01-01 32675235.72
2016-01-01 31144473.00
dtype: float64
The other option is called long data, where every row is an observation with a category variable.
year | continent | visas | |
---|---|---|---|
0 | 1997 | Africa | 534420.0 |
1 | 1998 | Africa | 568142.0 |
2 | 1999 | Africa | 599452.0 |
3 | 2000 | Africa | 655136.0 |
4 | 2001 | Africa | 697870.0 |
Converting long data to wide data
You can convert from long data to wide data using .pivot
year | continent | visas | |
---|---|---|---|
0 | 1997 | Africa | 534420.0 |
1 | 1998 | Africa | 568142.0 |
2 | 1999 | Africa | 599452.0 |
3 | 2000 | Africa | 655136.0 |
4 | 2001 | Africa | 697870.0 |
continent | Africa | Asia | Europe | North America | Oceania | South America | Unknown |
---|---|---|---|---|---|---|---|
year | |||||||
1997 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1998 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
1999 | 599452.0 | 4254634.0 | 2163808.0 | 3332712.0 | 105130.0 | 1883214.0 | 6238484.0 |
2000 | 655136.0 | 4912918.0 | 2190592.0 | 4723946.0 | 105670.0 | 1652472.0 | 7184174.0 |
2001 | 697870.0 | 4883414.0 | 2130078.0 | 5589696.0 | 111022.0 | 1730232.0 | 7624022.0 |
Converting wide data to long data
You can convert from wide data to long data using .melt
Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|
year | |||||||
1997 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1998 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
year | Africa | Asia | Europe | North America | Oceania | South America | Unknown | |
---|---|---|---|---|---|---|---|---|
0 | 1997 | 534420.0 | 4779368.0 | 2290328.0 | 1934950.0 | 92818.0 | 2159310.0 | 6034989.0 |
1 | 1998 | 568142.0 | 4095252.0 | 2198228.0 | 2455302.0 | 94106.0 | 2170502.0 | 5860927.0 |
year | continent | visas | |
---|---|---|---|
0 | 1997 | Africa | 534420.00 |
1 | 1998 | Africa | 568142.00 |
2 | 1999 | Africa | 599452.00 |
3 | 2000 | Africa | 655136.00 |
4 | 2001 | Africa | 697870.00 |
5 | 2002 | Africa | 498682.00 |
6 | 2003 | Africa | 478112.00 |
7 | 2004 | Africa | 501824.00 |
8 | 2005 | Africa | 505656.00 |
9 | 2006 | Africa | 509844.00 |
10 | 2007 | Africa | 553236.00 |
11 | 2008 | Africa | 624568.00 |
12 | 2009 | Africa | 582636.00 |
13 | 2010 | Africa | 607702.00 |
14 | 2011 | Africa | 651340.00 |
15 | 2012 | Africa | 774964.00 |
16 | 2013 | Africa | 807014.00 |
17 | 2014 | Africa | 939464.00 |
18 | 2015 | Africa | 987590.00 |
19 | 2016 | Africa | 1102308.00 |
20 | 1997 | Asia | 4779368.00 |
21 | 1998 | Asia | 4095252.00 |
22 | 1999 | Asia | 4254634.00 |
23 | 2000 | Asia | 4912918.00 |
24 | 2001 | Asia | 4883414.00 |
25 | 2002 | Asia | 3709890.00 |
26 | 2003 | Asia | 3409702.00 |
27 | 2004 | Asia | 3679070.00 |
28 | 2005 | Asia | 4059790.00 |
29 | 2006 | Asia | 4491056.00 |
... | ... | ... | ... |
110 | 2007 | South America | 1972496.00 |
111 | 2008 | South America | 2205672.00 |
112 | 2009 | South America | 2312754.00 |
113 | 2010 | South America | 2515038.00 |
114 | 2011 | South America | 3397106.00 |
115 | 2012 | South America | 4353292.00 |
116 | 2013 | South America | 4541802.00 |
117 | 2014 | South America | 4686358.00 |
118 | 2015 | South America | 4345696.00 |
119 | 2016 | South America | 3515940.00 |
120 | 1997 | Unknown | 6034989.00 |
121 | 1998 | Unknown | 5860927.00 |
122 | 1999 | Unknown | 6238484.00 |
123 | 2000 | Unknown | 7184174.00 |
124 | 2001 | Unknown | 7624022.00 |
125 | 2002 | Unknown | 5796777.00 |
126 | 2003 | Unknown | 4897012.00 |
127 | 2004 | Unknown | 5057073.00 |
128 | 2005 | Unknown | 5395153.00 |
129 | 2006 | Unknown | 5842172.00 |
130 | 2007 | Unknown | 6450295.00 |
131 | 2008 | Unknown | 6609864.00 |
132 | 2009 | Unknown | 5809512.00 |
133 | 2010 | Unknown | 6428425.00 |
134 | 2011 | Unknown | 7514001.00 |
135 | 2012 | Unknown | 8933756.00 |
136 | 2013 | Unknown | 9172705.00 |
137 | 2014 | Unknown | 9941932.00 |
138 | 2015 | Unknown | 10898813.24 |
139 | 2016 | Unknown | 10388767.00 |
140 rows × 3 columns