%matplotlib inline
import pandas as pd

Processing time with pandas

Let’s open up some data from the Census bureau - we’re going to use New Home Sales. The data is formatted… oddly, so I’ve cleaned it up for you as home-sales.csv inside of the data folder.

Open it without moving it. Tab autocomplete will help you.

df = pd.read_csv("data/home-sales.csv", usecols=['val','per_name'])
df.head()
val per_name
0 42.0 1963-01-01
1 35.0 1963-02-01
2 44.0 1963-03-01
3 52.0 1963-04-01
4 58.0 1963-05-01
pd.set_option("display.max_columns", 100)
pd.options.display.max_columns
100
df.tail()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
636 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-01-01
637 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-02-01
638 0 49.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-03-01
639 0 57.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-04-01
640 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-05-01
df.plot(x='per_name', y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x105de97b8>

png

df.plot(y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x105d4a2b0>

png

Creating a datetime column

df.per_name.head()
0    1963-01-01
1    1963-02-01
2    1963-03-01
3    1963-04-01
4    1963-05-01
Name: per_name, dtype: object
%%time
pd.to_datetime(df.per_name).head()
CPU times: user 963 µs, sys: 0 ns, total: 963 µs
Wall time: 978 µs





0   1963-01-01
1   1963-02-01
2   1963-03-01
3   1963-04-01
4   1963-05-01
Name: per_name, dtype: datetime64[ns]
%%time
pd.to_datetime(df.per_name, format="%Y-%m-%d").head()
CPU times: user 1.46 ms, sys: 2 µs, total: 1.46 ms
Wall time: 1.48 ms





0   1963-01-01
1   1963-02-01
2   1963-03-01
3   1963-04-01
4   1963-05-01
Name: per_name, dtype: datetime64[ns]
df['date'] = pd.to_datetime(df.per_name, format="%Y-%m-%d")
df.head()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name date
0 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-01-01 1963-01-01
1 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-02-01 1963-02-01
2 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-03-01 1963-03-01
3 0 52.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-04-01 1963-04-01
4 0 58.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-05-01 1963-05-01
df.dtypes
is_adj                 int64
val                  float64
cat_code              object
cat_desc              object
cat_indent             int64
dt_code               object
dt_desc               object
dt_unit               object
geo_code              object
geo_desc              object
per_name              object
date          datetime64[ns]
dtype: object

Changing the index to the datetime

Normally the index of the column is just a number.

df.set_index('date', inplace=True)
df.head()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1963-01-01 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-01-01
1963-02-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-02-01
1963-03-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-03-01
1963-04-01 0 52.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-04-01
1963-05-01 0 58.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-05-01
df.head()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name date
0 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-01-01 1963-01-01
1 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-02-01 1963-02-01
2 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-03-01 1963-03-01
3 0 52.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-04-01 1963-04-01
4 0 58.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-05-01 1963-05-01

It’s the column on the far left - 0, 1, 2, 3, 4… boring and useless! If we use .set_index to replace the index with the datetime, though, we can start to have some fun

Be sure you use inplace=True or else it won’t save the new index!

Selecting specific(-ish) dates via the index

Now that our index is a datetime, we can select date ranges super super easily.

Selecting by year

df['1975']
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1975-01-01 0 29.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-01-01
1975-02-01 0 34.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-02-01
1975-03-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-03-01
1975-04-01 0 54.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-04-01
1975-05-01 0 57.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-05-01
1975-06-01 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-06-01
1975-07-01 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-07-01
1975-08-01 0 53.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-08-01
1975-09-01 0 46.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-09-01
1975-10-01 0 46.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-10-01
1975-11-01 0 46.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-11-01
1975-12-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1975-12-01

List slices with datetimes

We can also use list slicing with datetimes! Usually we would say things like `df10

Just for review, you can use : to only select certain parts of a list:

# Make our list of fruits
ranked_fruits = ('banana', 'orange', 'apple', 'blueberries', 'strawberries')
ranked_fruits
('banana', 'orange', 'apple', 'blueberries', 'strawberries')
# Start from the beginning, get the first two
ranked_fruits[:2]
('banana', 'orange')
# Start from two, get up until the fourth element
ranked_fruits[1:4]
('orange', 'apple', 'blueberries')
# Starting from the third element, get all the rest
ranked_fruits[3:]
('blueberries', 'strawberries')

Instead of using boring ol’ numbers, we can use dates instead.

Getting rows after a certain date

df["1970":]
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1970-01-01 0 34.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-01-01
1970-02-01 0 29.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-02-01
1970-03-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-03-01
1970-04-01 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-04-01
1970-05-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-05-01
1970-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-06-01
1970-07-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-07-01
1970-08-01 0 48.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-08-01
1970-09-01 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-09-01
1970-10-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-10-01
1970-11-01 0 40.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-11-01
1970-12-01 0 37.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-12-01
1971-01-01 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-01-01
1971-02-01 0 49.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-02-01
1971-03-01 0 62.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-03-01
1971-04-01 0 62.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-04-01
1971-05-01 0 58.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-05-01
1971-06-01 0 59.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-06-01
1971-07-01 0 64.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-07-01
1971-08-01 0 62.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-08-01
1971-09-01 0 50.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-09-01
1971-10-01 0 52.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-10-01
1971-11-01 0 50.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-11-01
1971-12-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-12-01
1972-01-01 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-01-01
1972-02-01 0 56.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-02-01
1972-03-01 0 60.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-03-01
1972-04-01 0 65.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-04-01
1972-05-01 0 64.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-05-01
1972-06-01 0 63.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-06-01
... ... ... ... ... ... ... ... ... ... ... ...
2013-12-01 0 31.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2013-12-01
2014-01-01 0 33.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-01-01
2014-02-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-02-01
2014-03-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-03-01
2014-04-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-04-01
2014-05-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-05-01
2014-06-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-06-01
2014-07-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-07-01
2014-08-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-08-01
2014-09-01 0 37.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-09-01
2014-10-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-10-01
2014-11-01 0 31.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-11-01
2014-12-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-12-01
2015-01-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-01-01
2015-02-01 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-02-01
2015-03-01 0 46.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-03-01
2015-04-01 0 48.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-04-01
2015-05-01 0 47.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-05-01
2015-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-06-01
2015-07-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-07-01
2015-08-01 0 41.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-08-01
2015-09-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-09-01
2015-10-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-10-01
2015-11-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-11-01
2015-12-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-12-01
2016-01-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-01-01
2016-02-01 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-02-01
2016-03-01 0 49.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-03-01
2016-04-01 0 57.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-04-01
2016-05-01 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2016-05-01

557 rows × 11 columns

Getting rows between a certain date

# Between Feb of 1973
# and July of 1975
df["1970-02":"1975-07"].head()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1970-02-01 0 29.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-02-01
1970-03-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-03-01
1970-04-01 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-04-01
1970-05-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-05-01
1970-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-06-01

Count the number of houses sold in the 70’s and count the number sold in the

80’s

# "val" is thousands of houses sold
# so say, give me everything in the 70's
# and then grab the 'val' column
# and sum it up
df["1970":"1979"].val.sum()
6557.0
# We'll do the same thing for the 80's!
df["1980":"1989"].val.sum()
6088.0

How do we select every single February?

df.per_name.str.contains("-02-").head()
date
1963-01-01    False
1963-02-01     True
1963-03-01    False
1963-04-01    False
1963-05-01    False
Name: per_name, dtype: bool
df[df.index.month.isin([6,7,8])]
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1963-06-01 0 48.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-06-01
1963-07-01 0 62.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-07-01
1963-08-01 0 56.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-08-01
1964-06-01 0 53.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1964-06-01
1964-07-01 0 54.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1964-07-01
1964-08-01 0 56.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1964-08-01
1965-06-01 0 57.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1965-06-01
1965-07-01 0 51.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1965-07-01
1965-08-01 0 58.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1965-08-01
1966-06-01 0 40.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1966-06-01
1966-07-01 0 40.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1966-07-01
1966-08-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1966-08-01
1967-06-01 0 47.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1967-06-01
1967-07-01 0 46.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1967-07-01
1967-08-01 0 47.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1967-08-01
1968-06-01 0 41.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1968-06-01
1968-07-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1968-07-01
1968-08-01 0 47.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1968-08-01
1969-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1969-06-01
1969-07-01 0 39.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1969-07-01
1969-08-01 0 40.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1969-08-01
1970-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-06-01
1970-07-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-07-01
1970-08-01 0 48.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1970-08-01
1971-06-01 0 59.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-06-01
1971-07-01 0 64.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-07-01
1971-08-01 0 62.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1971-08-01
1972-06-01 0 63.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-06-01
1972-07-01 0 63.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-07-01
1972-08-01 0 72.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1972-08-01
... ... ... ... ... ... ... ... ... ... ... ...
2006-06-01 0 98.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2006-06-01
2006-07-01 0 83.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2006-07-01
2006-08-01 0 88.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2006-08-01
2007-06-01 0 73.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2007-06-01
2007-07-01 0 68.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2007-07-01
2007-08-01 0 60.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2007-08-01
2008-06-01 0 45.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2008-06-01
2008-07-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2008-07-01
2008-08-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2008-08-01
2009-06-01 0 37.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2009-06-01
2009-07-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2009-07-01
2009-08-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2009-08-01
2010-06-01 0 28.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-06-01
2010-07-01 0 26.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-07-01
2010-08-01 0 23.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-08-01
2011-06-01 0 28.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2011-06-01
2011-07-01 0 27.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2011-07-01
2011-08-01 0 25.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2011-08-01
2012-06-01 0 34.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2012-06-01
2012-07-01 0 33.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2012-07-01
2012-08-01 0 31.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2012-08-01
2013-06-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2013-06-01
2013-07-01 0 33.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2013-07-01
2013-08-01 0 31.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2013-08-01
2014-06-01 0 38.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-06-01
2014-07-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-07-01
2014-08-01 0 36.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2014-08-01
2015-06-01 0 44.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-06-01
2015-07-01 0 43.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-07-01
2015-08-01 0 41.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2015-08-01

159 rows × 11 columns

Info on our time series

If you try to .plot, pandas will automatically use the index (the date) as the x axis for you. This makes like perfect. because you don’t have to think about anything, and calculations automatically have a good axis.

df.plot(y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x105d4ae80>

png

df["1990":"1999"].plot(y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x10615b748>

png

Grouping with resample, not with groupby

Hmmm, looks like something bad might have happened to the housing industry t some point. Maybe we want to see some numbers instead of a graph? To do aggregate statistics on time series in pandas we use a method called .resample(), and we’re going to tell it to group the data by year.

When we tell it to group by year, we need to give it a special code. I always get mine from this StackOverflow post http://stackoverflow.com/a/17001474 because it’s much more convenient than the pandas documentation.

df.head(2)
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
1963-01-01 0 42.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-01-01
1963-02-01 0 35.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 1963-02-01
df.resample('A').mean().plot(y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x106296ef0>

png

df.resample('A').mean()['val'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1063e8320>

png

df['val'].resample('A').mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x106506390>

png

df['val'].resample('2Q').mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x106216860>

png

df['val'].resample('9M').mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10689aef0>

png

Notice that it’s December of every year. That still looks like too much data, though. What if we back out to every decade?

Cool, right?

Graphing

We can graph these instead of just look at them! Get ready!

Graphing all on one plot

We’ve done this before, but it’s more exciting now - save the first plot as ax and pass it to the others as the confusingly- or conveniently-named ax=ax.

import matplotlib.pyplot as plt
plt.style.use('ggplot')
df['val'].plot()
df['val'].resample('A').mean().plot()
df['val'].resample('5A').mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x107422f98>

png

Cyclical data (actually using groupby)

What were the top 5 worst months?

df.sort_values(by='val').head()
is_adj val cat_code cat_desc cat_indent dt_code dt_desc dt_unit geo_code geo_desc per_name
date
2010-11-01 0 20.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-11-01
2011-01-01 0 21.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2011-01-01
2011-02-01 0 22.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2011-02-01
2010-08-01 0 23.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-08-01
2010-10-01 0 23.0 SOLD New Single-family Houses Sold 0 TOTAL All Houses K US United States 2010-10-01

It seems like there might be a cycle ever year. Maybe houses are sold in the summer and not the winter? To do this we can’t use resample - it’s for putting time into buckets - we need to group by the month.

Getting the month

We can’t ask for the index column as “year” any more, but we can just use df.index instead.

df.index.month
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
             8,  9, 10, 11, 12,  1,  2,  3,  4,  5],
           dtype='int64', name='date', length=641)

To get the month of each date, it’s simply df.index.month.

Doing the groupby to view data by month

So when we do our groupby, we’ll say hey, we made the groups for you already. Then we ask for the median number of houses sold.

df.index.dayofweek
Int64Index([1, 4, 4, 0, 2, 5, 0, 3, 6, 1,
            ...
            5, 1, 3, 6, 1, 4, 0, 1, 4, 6],
           dtype='int64', name='date', length=641)
df.groupby(by=df.index.month).mean()
is_adj val cat_indent
date
1 0 47.685185 0
2 0 53.592593 0
3 0 62.685185 0
4 0 61.444444 0
5 0 61.129630 0
6 0 59.188679 0
7 0 56.735849 0
8 0 57.566038 0
9 0 52.037736 0
10 0 51.584906 0
11 0 45.471698 0
12 0 42.792453 0

Plot the results

df.groupby(by=df.index.month).mean().plot(y='val')
<matplotlib.axes._subplots.AxesSubplot at 0x10710cef0>

png

df['val'].groupby(by=df.index.month).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10713e630>

png

More details

You can also use max and min and all of your other aggregate friends with .resample. For example, what’s the largest number of houses sold in a given year?

df['val'].resample('A').sum().sort_values(ascending=False).head(3)
date
2005-12-31    1283.0
2004-12-31    1203.0
2003-12-31    1088.0
Name: val, dtype: float64

How about the fewest?

df['val'].resample('A').sum().sort_values().head(3)
date
2016-12-31    241.0
2011-12-31    305.0
2010-12-31    322.0
Name: val, dtype: float64

Since we’re feeling crazy, why don’t we plot the average, max and min for each year?

df['val'].resample('A').agg(['mean','min','max']).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1082cb7f0>

png

df['val'].resample('A').mean().plot(color='green')
df['val'].resample('A').min().plot(color='blue')
df['val'].resample('A').max().plot(color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x1082c34e0>

png

Using .fill_between to draw between lines

Getting maximums and minimums for years is a very common way of dealing with resampled time series. A fun graph to make is the fill_between graph, which colors between two areas.

Step 1: Calculate the maxes and mins and get a list of x values

ax.fill_between requires three things

- a list of x values (the dates),
- a list of minimum values for those x values,
- and a list of maximum values for those x values
# This is what our x values will look like
df['val'].resample('A').index
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 
.resample() is now a deferred operation
You called index(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead
  





DatetimeIndex(['1963-12-31', '1964-12-31', '1965-12-31', '1966-12-31',
               '1967-12-31', '1968-12-31', '1969-12-31', '1970-12-31',
               '1971-12-31', '1972-12-31', '1973-12-31', '1974-12-31',
               '1975-12-31', '1976-12-31', '1977-12-31', '1978-12-31',
               '1979-12-31', '1980-12-31', '1981-12-31', '1982-12-31',
               '1983-12-31', '1984-12-31', '1985-12-31', '1986-12-31',
               '1987-12-31', '1988-12-31', '1989-12-31', '1990-12-31',
               '1991-12-31', '1992-12-31', '1993-12-31', '1994-12-31',
               '1995-12-31', '1996-12-31', '1997-12-31', '1998-12-31',
               '1999-12-31', '2000-12-31', '2001-12-31', '2002-12-31',
               '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31',
               '2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31',
               '2011-12-31', '2012-12-31', '2013-12-31', '2014-12-31',
               '2015-12-31', '2016-12-31'],
              dtype='datetime64[ns]', name='date', freq='A-DEC')
y_maxes = df['val'].resample('A').max()
y_mins = df['val'].resample('A').min()
y_maxes.head()
date
1963-12-31    62.0
1964-12-31    56.0
1965-12-31    58.0
1966-12-31    53.0
1967-12-31    49.0
Freq: A-DEC, Name: val, dtype: float64

Step 2: Build the graph

df['val'].resample('A').mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x108527668>

png

Step 3: Add the fill

ax.fill_between requires three things

- a list of x values
- a list of minimum values
- and a list of maximum values

We have all of those, so let’s draw it!

ax = df['val'].resample('A').mean().plot()

x_values = df['val'].resample('A').index
y_maxes = df['val'].resample('A').max()
y_mins = df['val'].resample('A').min()

ax.fill_between(x_values, y_maxes, y_mins, alpha=0.5)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: 
.resample() is now a deferred operation
You called index(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead
  This is separate from the ipykernel package so we can avoid doing imports until





<matplotlib.collections.PolyCollection at 0x109244eb8>

png