import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams.update({ 'axes.spines.left': False, 'axes.spines.right': False, 'axes.spines.top': False, 'axes.spines.bottom': False })

%matplotlib inline

Adjusting for population

The “Texas/California Problem”../

“When you have states with big populations or places with a large population of one specific type, your data may simply be representing population and nothing else. So normalizing for population or understanding your data, to begin with, are both important steps. It’s not enough to scrape it and chart it, you need to step back and understand what you are looking at. Just watch out for big populations.”

Let’s say we’re graphing how many central libraries are in each state. We pull in the data, it has the state and the number of libraries, and we graph it.

Adjusting for population with long (normal) data

df = pd.read_csv("libraries.csv")
df.head()
state libraries
0 Alabama 284
1 Alaska 102
2 Arizona 185
3 Arkansas 210
4 California 1084
ax = df.sort_values(by='libraries').tail(10).plot(x='state', y='libraries', kind='barh', color='#F28FBE')
ax.set_title('Libraries per state', loc='left')
Text(0,1,'Libraries per state')

png

According to this map, California, New York, and Texas all have the most libraries. There’s something we’re forgetting, though, and it’s that CA, NY and TX also all have the most people.

More people = more libraries. We’re basically making a graph of libraries that’s a proxy for population!

To fix this, we need to adjust for population, and make it a per-capita (per person) graph.

Step 1: Read in your population data

Make sure it has a matching column to your original data, whether it’s state name, country name, borough name, census code, etc.

pop = pd.read_csv("../population.csv")
pop.head()
State or territory Population
0 California 39250017
1 Texas 27862596
2 Florida 20612439
3 New York 19745289
4 Illinois 12801539

Step 2: Join the population data to your original data

Using .merge and left_on + right_on (the column names they have in common) you can easily join your population data to your original data.

merged = df.merge(pop, left_on='state', right_on='State or territory')
merged.head()
state libraries State or territory Population
0 Alabama 284 Alabama 4863300
1 Alaska 102 Alaska 741894
2 Arizona 185 Arizona 6931071
3 Arkansas 210 Arkansas 2988248
4 California 1084 California 39250017

Step 3: Do the math

merged['libraries_per_capita'] = merged['libraries'] / merged['Population']
merged.head()
state libraries State or territory Population libraries_per_capita
0 Alabama 284 Alabama 4863300 0.000058
1 Alaska 102 Alaska 741894 0.000137
2 Arizona 185 Arizona 6931071 0.000027
3 Arkansas 210 Arkansas 2988248 0.000070
4 California 1084 California 39250017 0.000028

Step 4: Adjust the math

0.000058 libraries per person” just doesn’t sound interesting! For smaller numbers, you’ll want to make it “per 100 people,” per 10,000 people,” or even per million people.

merged['libraries_per_100k_people'] = merged['libraries'] / merged['Population'] * 100000
merged.head()
state libraries State or territory Population libraries_per_capita libraries_per_100k_people
0 Alabama 284 Alabama 4863300 0.000058 5.839656
1 Alaska 102 Alaska 741894 0.000137 13.748595
2 Arizona 185 Arizona 6931071 0.000027 2.669140
3 Arkansas 210 Arkansas 2988248 0.000070 7.027529
4 California 1084 California 39250017 0.000028 2.761782
ax = merged.sort_values(by='libraries_per_100k_people').tail(10).plot(x='state', y='libraries_per_100k_people', kind='barh', color='#F28FBE', figsize=(5,3.5))
ax.set_title("Libraries per 100,000 residents", loc='left')
Text(0,1,'Libraries per 100,000 residents')

png

Step 5: Celebrate

Look at that, we’ve actually got a somewhat-interesting story! Instead of New York, California and Texas having the most, it looks like Vermont is way ahead of everyone else. Something to investigate, perhaps?

Adjusting for population with wide data (or adjusting for multiple places over

multiple years)

Adjusting for population with wide data is a little bit different. Not just because you’ll need to adjust for multiple places over multiple years, but because you need to do it across multiple columns.

If you had long data you could just join on multiple columns and do simple math, but with wide data you gotta get a little fancy.

Step 1: Read in your data

This is how many ice cream shops were in Staten Island, Manhattan, and Brooklyn in 1850, 1900, 1950 and 2000 (not real data, sadly).

df = pd.read_csv("../borough-ice-cream.csv")
df.head()
year Staten Island Brooklyn Manhattan
0 1850 2 8 10
1 1900 4 44 34
2 1950 10 34 44
3 2000 50 250 100
ax = df.plot(x='year', kind='bar')
ax.set_title("Number of ice cream shops", loc='left')
Text(0,1,'Number of ice cream shops')

png

But of course, if there are more people there should be more ice cream shops, so we should adjust for population. It’s kind of weird since we have wide data, but we can do it!

Step 2: Prepare your population data

Because we have multiple areas over multiple years, we can’t just say “../I want the population of Staten Island in 1850, 1900, 1950 and 2000” - we need the populations of every borough in each year. Our data will look like this:

pop = pd.read_csv("borough-population.csv")
pop.head()
year Staten Island Brooklyn Manhattan
0 1850 15061 138822 515547
1 1900 67021 1166582 1850093
2 1950 191555 2738175 1960101
3 2000 443728 2465326 1537195

I actually had to make this data by hand! You probably will, too. Make the rows and columns match up with your dataframe. Notice how I have the same years and the same column headers.

Step 3: Divide your data.

Just like we did before, we divide. But this time we need to divide each cell by the same cell in the other dataframe!

To make this work best, we’re going to change the index of the dataframes. When we do our math, it’s going to do it to every single cell, and we don’t want to divide the years, so we keep them safe in the index.

# Ice cream shops
df.set_index('year', inplace=True)
df.head()
Staten Island Brooklyn Manhattan
year
1850 2 8 10
1900 4 44 34
1950 10 34 44
2000 50 250 100
# Population
pop.set_index('year', inplace=True)
pop.head()
Staten Island Brooklyn Manhattan
year
1850 15061 138822 515547
1900 67021 1166582 1850093
1950 191555 2738175 1960101
2000 443728 2465326 1537195

Once we’ve made sure they match, we’ll do df.div to divide a dataframe by another dataframe. This divides each cell by the matching cell in the other dataframe.

per_capita = df.div(pop)
per_capita
Staten Island Brooklyn Manhattan
year
1850 0.000133 0.000058 0.000019
1900 0.000060 0.000038 0.000018
1950 0.000052 0.000012 0.000022
2000 0.000113 0.000101 0.000065

Looking pretty good! Let’s multiply by 100000 to make it per 100000 people.

per_capita = df.div(pop) * 100000
per_capita
Staten Island Brooklyn Manhattan
year
1850 13.279331 5.762775 1.939687
1900 5.968279 3.771702 1.837745
1950 5.220433 1.241703 2.244782
2000 11.268164 10.140647 6.505356
per_capita.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x115b1e160>

png

While our original chart had Brooklyn as loving ice cream the most, it looks like Staten Island wins if we look at things per-capita!

Step 4: Getting our old data back

Remember how we set the index to be the year?

df
Staten Island Brooklyn Manhattan
year
1850 5 8 10
1900 4 44 34
1950 10 34 44
2000 15 250 100

If we want the year back as a normal column, we just do a .reset_index(inplace=True)

df.reset_index(inplace=True)
df.head()
year Staten Island Brooklyn Manhattan
0 1850 5 8 10
1 1900 4 44 34
2 1950 10 34 44
3 2000 15 250 100