The setup

A lot of times when you’re using pandas you’re also going to use matplotlib, and when you’re using matplotlib you’re also going to be making inline graphics. So why not do it all at once?

# Let's do all of our setup
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

The import

Always name your initial dataframe df. Always always always. Why? Because everyone else does, and because

  • when you slice it and dice it into different pieces (only really rich people, only cats, only flights to Finland, etc)
  • you can give each of those names, while knowing df is your original dataframe.
df = pd.read_excel("rich_people.xlsx")

The examining

Your best friends for a quick peek are are

  • df.head() to see the first ten
  • df.columns to see the column names
  • df.info() to see more detailed info about each column
  • df.describe() to see summary statistics about each numeric column

Once you do these few steps, you should be able to draw up a list of ten or so questions to ask your data.

df.head(2)
year name rank citizenship countrycode networthusbillion selfmade typeofwealth gender age ... relationshiptocompany foundingdate gdpcurrentus sourceofwealth notes notes2 source source_2 source_3 source_4
0 2001 A Jerrold Perenchio 151 United States USA 3.0 self-made executive male 70.0 ... former chairman and CEO 1955.0 1.062180e+13 NaN represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN
1 2014 A. Jerrold Perenchio 663 United States USA 2.6 self-made executive male 83.0 ... former chairman and CEO 1955.0 NaN television, Univision represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN

2 rows × 30 columns

df.columns
Index(['year', 'name', 'rank', 'citizenship', 'countrycode',
       'networthusbillion', 'selfmade', 'typeofwealth', 'gender', 'age',
       'industry', 'IndustryAggregates', 'region', 'north',
       'politicalconnection', 'founder', 'generationofinheritance', 'sector',
       'company', 'companytype', 'relationshiptocompany', 'foundingdate',
       'gdpcurrentus', 'sourceofwealth', 'notes', 'notes2', 'source',
       'source_2', 'source_3', 'source_4'],
      dtype='object')

Slicing your data

I made a terrible mistake on Monday about how I taught sub-selects (the things that are basically SQL WHERE statements), and it’s all because of how I named my variables.

We’re going to take a sample of the data to work with so it looks a little cleaner - only take a few columns of the first 8. It’s a complicated line, don’t sweat it. Just think about the result for now.

sample = df[df['year'] == 2014][['name', 'countrycode', 'selfmade', 'rank', 'networthusbillion']].sort_values(by='rank').head(8)
sample
name countrycode selfmade rank networthusbillion
284 Bill Gates USA self-made 1 76.0
348 Carlos Slim Helu MEX self-made 2 72.0
124 Amancio Ortega ESP self-made 3 64.0
2491 Warren Buffett USA self-made 4 58.2
1377 Larry Ellison USA self-made 5 48.0
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0
2185 Sheldon Adelson USA self-made 8 38.0
plt.style.use('ggplot')
df['networthusbillion'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x10945b8d0>

png

Who is from the USA? There two different things we can do. This was a big point of confusion in class

  1. We can list whether someone is from the USA or not.
  2. Or, we can use that list to select only the people from the USA.

Let’s look at them both, then see how they’re used.

sample['countrycode'] == 'USA'
284      True
348     False
124     False
2491     True
1377     True
509      True
381      True
2185     True
Name: countrycode, dtype: bool

This just gave us a list of True and False values.

  • First person? Definitely from the USA.
  • Next row? Not from the USA.
  • Third person? Not from the USA.
  • Fourth? Yes, they’re from the USA.

On the other hand, we can also use similar syntax to get a list of every person from the USA.

sample[sample['countrycode'] == 'USA']
name countrycode selfmade rank networthusbillion
284 Bill Gates USA self-made 1 76.0
2491 Warren Buffett USA self-made 4 58.2
1377 Larry Ellison USA self-made 5 48.0
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0
2185 Sheldon Adelson USA self-made 8 38.0

That one was a two-step process,

  1. we got a list of True/False values (are they from the USA?), then
  2. fed that True/False list back into the dataframe.

For every row that’s True, it let us keep the row. If it was False, it doesn’t let us keep the row.

When to use what

Let’s say we wanted to select every single row that is from USA, has inherited wealth, and is in the top 10 wealthiest people in the world. If we wanted to do it all on one line, we could!

sample[(sample['countrycode'] == 'USA') & (sample['selfmade'] == 'inherited') & (sample['rank'] < 10)]
name countrycode selfmade rank networthusbillion
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0

That looks super confusing. In it we have three separate things we’re testing for (a.k.a. getting True/False back)

  • sample['countrycode'] == 'USA'
  • sample['selfmade'] == 'inherited'
  • sample['rank'] < 10

And we’re jamming them all into sample[blah blah] to get back a dataframe for which all of those are true. It works, but it’s ugly.

Again, let’s remind ourselves what a piece like that looks like, it’s just a list of True and False values

sample['countrycode'] == 'USA'
284      True
348     False
124     False
2491     True
1377     True
509      True
381      True
2185     True
Name: countrycode, dtype: bool

If we felt like the code we had before was a little too complicated, we could break it down into manageable, well-named pieces like this:

is_american = sample['countrycode'] == 'USA'
is_inherited = sample['selfmade'] == 'inherited'
is_top_ten = sample['rank'] < 10

sample[is_american & is_inherited & is_top_ten]
name countrycode selfmade rank networthusbillion
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0

Let’s look at each individual piece.

is_american, which is the same thing as sample['countrycode'] == 'USA', is not a list of people. It is not a dataframe. It is a list of true and false values.

is_american
284      True
348     False
124     False
2491     True
1377     True
509      True
381      True
2185     True
Name: countrycode, dtype: bool

It’s the same thing with is_inherited and is_top_ten. It’s just a shortcut for me, so I don’t have to type out sample['selfmade'] == 'inherited'. We can combine them all using & to get a list of True/False values in which every one of those conditions is true

is_american & is_inherited & is_top_ten
284     False
348     False
124     False
2491    False
1377    False
509      True
381      True
2185    False
dtype: bool

Which we can then feed into the dataframe

sample[is_american & is_inherited & is_top_ten]
name countrycode selfmade rank networthusbillion
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0

So what I’m saying is, these two things are the same:

Method One: Breaking into separate readable things

is_american = sample['countrycode'] == 'USA'
is_inherited = sample['selfmade'] == 'inherited'
is_top_ten = sample['rank'] < 10

kochs = sample[is_american & is_inherited & is_top_ten]
kochs
name countrycode selfmade rank networthusbillion
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0

Method Two: Doing it all at once

kochs = sample[(sample['countrycode'] == 'USA') & (sample['selfmade'] == 'inherited') & (sample['rank'] < 10)]
kochs
name countrycode selfmade rank networthusbillion
509 David Koch USA inherited 6 40.0
381 Charles Koch USA inherited 6 40.0

Make sure you know the difference between these two

  • df['countrycode'] == 'USA'
  • df[df['countrycode'] == 'USA']

Joining data

Now, let’s say we wanted to join a bunch of data together. We have a lot of countries here, right? And we’d like to know how much of their wealth is concentrated in these rich people.

df.head()
year name rank citizenship countrycode networthusbillion selfmade typeofwealth gender age ... relationshiptocompany foundingdate gdpcurrentus sourceofwealth notes notes2 source source_2 source_3 source_4
0 2001 A Jerrold Perenchio 151 United States USA 3.0 self-made executive male 70.0 ... former chairman and CEO 1955.0 1.062180e+13 NaN represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN
1 2014 A. Jerrold Perenchio 663 United States USA 2.6 self-made executive male 83.0 ... former chairman and CEO 1955.0 NaN television, Univision represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN
2 2001 Abdul Al Rahman Al Jeraisy 336 Saudi Arabia SAU 1.5 self-made founder non-finance male NaN ... founder 1956.0 1.830120e+11 NaN NaN NaN http://www.jeraisy.com.sa/index.php/pages/rend... NaN NaN NaN
3 2001 Abdul Aziz Al Ghurair 251 United Arab Emirates ARE 1.9 inherited inherited male 47.0 ... relation 1960.0 1.030000e+11 NaN inherited from father NaN NaN NaN NaN NaN
4 1996 Abdul Aziz Al-Sulaiman 404 Saudi Arabia SAU 1.0 self-made self-made finance male 0.0 ... founder 1968.0 1.577430e+11 NaN NaN NaN http://www.arabianbusiness.com/arabian-busines... NaN NaN NaN

5 rows × 30 columns

What do we combine it with?

I’m going to get my GDP data from this web site, because… I googled GDP 2014 and it came up.

We could manually say okay, so A. Jerrold Perenchio has 2.6 billion dollars, and USA’s GDP was 17.418 trillion (17418 billion), so he’s at about 0.015% of the GDP. But we are lazy, so we want the computer to do it. In order to do that, though, we need to merge that information into our existing dataset.

I’m also too lazy to look up the GDP for every country, so I’m just going to pick the top ten.

df['countrycode'].value_counts().head(10)
USA    903
DEU    160
CHN    153
RUS    119
JPN     96
BRA     81
HKG     77
FRA     72
GBR     65
IND     63
Name: countrycode, dtype: int64

I’m also too lazy to find a csv that has the data I need, so I’m going to make a dataframe out of a dictionary. It’s pretty easy.

# Make a list of dictionaries
country_gdp = [
    { 'country': 'USA', 'gdp': 17.418 },
    { 'country': 'DEU', 'gdp': 3.859 },
    { 'country': 'CHN', 'gdp': 10.380 },
    { 'country': 'RUS', 'gdp': 1.857 },
    { 'country': 'JPN', 'gdp': 4.616 },
    { 'country': 'BRA', 'gdp': 2.353 },
    { 'country': 'HKG', 'gdp': 0.289 },
    { 'country': 'FRA', 'gdp': 2.846 },
    { 'country': 'GBR', 'gdp': 2.945 },
    { 'country': 'IND', 'gdp': 2.049 }
]

# Convert it into a DataFrame
gdp_df = pd.DataFrame(country_gdp)
gdp_df
country gdp
0 USA 17.418
1 DEU 3.859
2 CHN 10.380
3 RUS 1.857
4 JPN 4.616
5 BRA 2.353
6 HKG 0.289
7 FRA 2.846
8 GBR 2.945
9 IND 2.049

I could have read it in from a csv, but since I didn’t have a csv I might as well just type it out. I also want it in billions, not in trillions, so let’s add a column real quick

gdp_df['gdp_billions'] = gdp_df['gdp'] * 1000
gdp_df
country gdp gdp_billions
0 USA 17.418 17418.0
1 DEU 3.859 3859.0
2 CHN 10.380 10380.0
3 RUS 1.857 1857.0
4 JPN 4.616 4616.0
5 BRA 2.353 2353.0
6 HKG 0.289 289.0
7 FRA 2.846 2846.0
8 GBR 2.945 2945.0
9 IND 2.049 2049.0

STEP ONE: Finding a join key

My favorite resource for joining data with pandas is this page. It compares everything to SQL JOINs, which I understand. You could also check out this page which I find kind of confusing.

So what we need to do is match up a column between our two dataframes.

df.head(1)
year name rank citizenship countrycode networthusbillion selfmade typeofwealth gender age ... relationshiptocompany foundingdate gdpcurrentus sourceofwealth notes notes2 source source_2 source_3 source_4
0 2001 A Jerrold Perenchio 151 United States USA 3.0 self-made executive male 70.0 ... former chairman and CEO 1955.0 1.062180e+13 NaN represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN

1 rows × 30 columns

gdp_df.head(1)
country gdp gdp_billions
0 USA 17.418 17418.0

It looks like we’re going to be matching up countrycode from df with country from gdp_df.

STEP TWO: Doing the merge

You merge using .merge, and then give two parameters:

  • left_on, the first dataframe’s column name to merge on
  • right_on, the second dataframe’s column name to merge on
# Since we're only working with 2014's GDP, let's make a variable
# that's only 2014 scores
recent = df[df['year'] == 2014]
recent.head(1)
year name rank citizenship countrycode networthusbillion selfmade typeofwealth gender age ... relationshiptocompany foundingdate gdpcurrentus sourceofwealth notes notes2 source source_2 source_3 source_4
1 2014 A. Jerrold Perenchio 663 United States USA 2.6 self-made executive male 83.0 ... former chairman and CEO 1955.0 NaN television, Univision represented Marlon Brando and Elizabeth Taylor NaN http://en.wikipedia.org/wiki/Jerry_Perenchio http://www.forbes.com/profile/a-jerrold-perenc... COLUMN ONE; A Hollywood Player Who Owns the Ga... NaN

1 rows × 30 columns

# Let's do the merge
people_with_gdp = recent.merge(gdp_df, left_on='countrycode', right_on='country')

# Look at the top 10
people_with_gdp.sort_values(by='rank').head()
year name rank citizenship countrycode networthusbillion selfmade typeofwealth gender age ... sourceofwealth notes notes2 source source_2 source_3 source_4 country gdp gdp_billions
33 2014 Bill Gates 1 United States USA 76.0 self-made founder non-finance male 58.0 ... Microsoft NaN NaN http://www.forbes.com/profile/bill-gates/ NaN NaN NaN USA 17.418 17418.0
484 2014 Warren Buffett 4 United States USA 58.2 self-made founder non-finance male 83.0 ... Berkshire Hathaway NaN NaN http://www.forbes.com/lists/2009/10/billionair... http://www.forbes.com/companies/berkshire-hath... NaN NaN USA 17.418 17418.0
280 2014 Larry Ellison 5 United States USA 48.0 self-made founder non-finance male 69.0 ... Oracle NaN NaN http://www.forbes.com/profile/larry-ellison/ http://www.businessinsider.com/how-larry-ellis... NaN NaN USA 17.418 17418.0
57 2014 Charles Koch 6 United States USA 40.0 inherited inherited male 78.0 ... diversified inherited from father NaN http://www.kochind.com/About_Koch/History_Time... NaN NaN NaN USA 17.418 17418.0
93 2014 David Koch 6 United States USA 40.0 inherited inherited male 73.0 ... diversified inherited from father NaN http://www.kochind.com/About_Koch/History_Time... NaN NaN NaN USA 17.418 17418.0

5 rows × 33 columns

Hmmm see how rank skips through 1, 4, 5, 6, 6? This is because we only kept rows we have matches for. The second wealthiest person is from Mexico, but since we didn’t have GDP from Mexico it left. You can see if we just do a simple .count()

len(recent)
1653
len(people_with_gdp)
1130

We lost like 500 people! Maybe even if we don’t have the GDP of their country we want to keep them?

Types of joins

Let’s just look at our data real quick before I explain the two major types of joins.

# How many rows in our 2014 dataframe?
print("We start with", len(recent))
recent[['rank','name', 'countrycode']].sort_values(by='rank').head(3)
We start with 1653
rank name countrycode
284 1 Bill Gates USA
348 2 Carlos Slim Helu MEX
124 3 Amancio Ortega ESP

Inner joins

The default kind of join is an inner join. It requires you to have a match on the left hand side AND the right hand side.

If you’re in the original dataframe and you don’t have a match in the joining dataframe, you disappear. Watch how Carlos “Slim” Helu and Amancio Ortega both disappear.

# Inner join must have a match in both tables
inner_join_df = recent.merge(gdp_df, left_on='countrycode', right_on='country')
print("We now have", len(inner_join_df))
inner_join_df[['rank','name', 'countrycode', 'gdp_billions']].sort_values(by='rank').head(3)
We now have 1130
rank name countrycode gdp_billions
33 1 Bill Gates USA 17418.0
484 4 Warren Buffett USA 17418.0
280 5 Larry Ellison USA 17418.0

Left join

The other major kind of join is a left join. The “left” dataframe (the one you type first), gets to keep its rows, even if there aren’t any matches in the joining dataframe.

# Do a left join, we keep everyone even if they don't have a match in the gdp table
left_join_df = recent.merge(gdp_df, left_on='countrycode', right_on='country', how='left')
print("We now have", len(left_join_df))
left_join_df[['rank', 'name','countrycode', 'gdp_billions']].sort_values(by='rank').head(3)
We now have 1653
rank name countrycode gdp_billions
184 1 Bill Gates USA 17418.0
226 2 Carlos Slim Helu MEX NaN
79 3 Amancio Ortega ESP NaN

A pretty join picture

Did you see this image already? Well, now you have.

Functions

We should have a whole class dedicated to functions, but just in case I’m going to put it down in here!

A function is a reuseable chunk of code. You know how we write len whenever we want the length of a list? We can make our own things like that!

A simple function

Let’s say my job is to take two numbers, add them together, and multiply by two.

# 3 and 4
print((3 + 4) * 2)

# 4 and 6
print((4 + 6) * 2)

# 100 and 40
print((100 + 40) * 2)
14
20
280

I’m clearly prettty good at my job, but I get tired of writing the same thing over and over again, and what if I accidentally make a typo? It would be a lot easier if there was a function like len that, instead of giving me the length of a list, could take two numbers and print out the answer.

So, let’s make that.

def calc(a, b):
    added = a + b
    return added * 2;

print(calc(3, 4))
print(calc(4, 6))
print(calc(100, 40))
14
20
280

I can even use it down here, save it to a variable, whatever

result = calc(10, 400)
print("The calculated result is", result)
The calculated result is 820

Parts of a function

Our function looks like this:

def calc(a, b):
    added = a + b
    return added * 2;

It’s a function called calc that takes two parameters - we call them a and b - and then sends back a calculated result.

  • def means “I’m going to define a function now
  • calc is the function’s name
  • (a, b) are the parameters it takes - when I use calc, what can I send it?
  • return lets the function know that I’m finished and ready to send something back.