01: Building a pandas Cheat Sheet, Part 1: Animals

  • Use the csv I’ve attached to answer the following questions
  • Import pandas with the right name
  • Set all graphics from matplotlib to display inline
  • Read the csv in (it should be UTF-8 already so you don’t have to worry about encoding), save it with the proper boring name
  • Display the names of the columns in the csv
  • Display the first 3 animals.
  • Sort the animals to see the 3 longest animals.
  • What are the counts of the different values of the “../animal” column? a.k.a. how many cats and how many dogs.
  • Only select the dogs.
  • Display all of the animals that are greater than 40 cm. ‘length’ is the animal’s length in cm. Create a new column called inches that is the length in inches.
  • Save the cats to a separate variable called “cats.” Save the dogs to a separate variable called “dogs.”
  • Display all of the animals that are cats and above 12 inches long. First do it using the “cats” variable, then do it using your normal dataframe.
  • What’s the mean length of a cat?
  • What’s the mean length of a dog?
  • Use groupby to accomplish both of the above tasks at once.
  • Make a histogram of the length of dogs. I apologize that it is so boring.
  • Change your graphing style to be something else (anything else!)
  • Make a horizontal bar graph of the length of the animals, with their name as the label (look at the billionaires notebook I put on Slack!)
  • Make a sorted horizontal bar graph of the cats, with the larger cats on top.
# Import pandas with the right name
# Set all graphics from matplotlib to display inline
import pandas as pd
import matplotlib.pyplot as plt
# don't do this, but it means the thing above
#from matplotlib import pyplot as plt
%matplotlib inline
  • Read the csv in (it should be UTF-8 already so you don’t have to worry about encoding), save it with the proper boring name
  • Display the names of the columns in the csv
  • Display the first 3 animals.
df = pd.read_csv("07-hw-animals.csv")
df.columns
Index(['animal', 'name', 'length'], dtype='object')
df.head(3)
animal name length
0 cat Anne 35
1 cat Bob 45
2 dog Egglesburg 65
# * Sort the animals to see the 3 longest animals.
df.sort_values(by='length', ascending=False).head(3)
animal name length
2 dog Egglesburg 65
3 dog Devon 50
1 cat Bob 45
# * What are the counts of the different values of the "animal" column? a.k.a. how many cats and how many dogs.
df['animal'].value_counts()
cat    3
dog    3
Name: animal, dtype: int64
# * Only select the dogs.
df['animal'] == 'dog'
0    False
1    False
2     True
3     True
4    False
5     True
Name: animal, dtype: bool
# If you want the rows back, you have to put a df[ ] on the outside
df[df['animal'] == 'dog']
animal name length
2 dog Egglesburg 65
3 dog Devon 50
5 dog Fontaine 35
is_dog = df['animal'] == 'dog'
is_dog
0    False
1    False
2     True
3     True
4    False
5     True
Name: animal, dtype: bool
dogs = df[is_dog]
dogs
animal name length
2 dog Egglesburg 65
3 dog Devon 50
5 dog Fontaine 35
dogs[df['length'] > 40]
/Users/soma/.virtualenvs/data-analysis/lib/python3.4/site-packages/ipykernel/__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':
animal name length
2 dog Egglesburg 65
3 dog Devon 50
df['animal'] == 'dog'
0    False
1    False
2     True
3     True
4    False
5     True
Name: animal, dtype: bool
df[df['animal'] == 'dog']
animal name length
2 dog Egglesburg 65
3 dog Devon 50
5 dog Fontaine 35
#* Display all of the animals that are greater than 40 cm.
# 'length' is the animal's length in cm. Create a new column called inches that is the length in inches.
#* Save the cats to a separate variable called "cats." Save the dogs to a separate variable called "dogs."
#* Display all of the animals that are cats and above 12 inches long. First do it using the "cats" variable, then do it using your normal dataframe.
df[df['length'] > 40]
animal name length
1 cat Bob 45
2 dog Egglesburg 65
3 dog Devon 50
# save a new column using a calculation on an existing column
df['inches'] = df['length'] / 2.54
df
animal name length inches
0 cat Anne 35 13.779528
1 cat Bob 45 17.716535
2 dog Egglesburg 65 25.590551
3 dog Devon 50 19.685039
4 cat Charlie 32 12.598425
5 dog Fontaine 35 13.779528
#* Display all of the animals that are cats and above 12 inches long.
# First do it using the "cats" variable, then do it using your normal dataframe.
cats = df[df['animal'] == 'cat']
cats
animal name length inches
0 cat Anne 35 13.779528
1 cat Bob 45 17.716535
4 cat Charlie 32 12.598425
cats[cats['inches'] > 12]
animal name length inches
0 cat Anne 35 13.779528
1 cat Bob 45 17.716535
4 cat Charlie 32 12.598425
#df[df['animal'] == 'cat' & df['inches'] > 12]
big_cats = df[(df['animal'] == 'cat') & (df['inches'] > 12)]
big_cats
animal name length inches
0 cat Anne 35 13.779528
1 cat Bob 45 17.716535
4 cat Charlie 32 12.598425
is_cat = df['animal'] == 'cat'
is_over_twelve_inches = df['inches'] > 12
df[is_cat & is_over_twelve_inches]
animal name length inches
0 cat Anne 35 13.779528
1 cat Bob 45 17.716535
4 cat Charlie 32 12.598425
#* What's the mean length of a cat?
#* What's the mean length of a dog?
#* Use groupby to accomplish both of the above tasks at once.
#* Make a histogram of the length of dogs. I apologize that it is so boring.
#* Change your graphing style to be something else (anything else!)
#* Make a horizontal bar graph of the length of the animals, with their name as the label (look at the billionaires notebook I put on Slack!)
#* Make a sorted horizontal bar graph of the cats, with the larger cats on top.
cats['length'].mean()
37.333333333333336
cats['length'].describe()
count     3.000000
mean     37.333333
std       6.806859
min      32.000000
25%      33.500000
50%      35.000000
75%      40.000000
max      45.000000
Name: length, dtype: float64
dogs = df[df['animal'] == 'dog']
dogs['length'].mean()
50.0
dogs['length'].describe()
count     3.0
mean     50.0
std      15.0
min      35.0
25%      42.5
50%      50.0
75%      57.5
max      65.0
Name: length, dtype: float64
#* Use groupby to accomplish both of the above tasks at once.
df.groupby('animal').describe()
inches length
animal
cat count 3.000000 3.000000
mean 14.698163 37.333333
std 2.679866 6.806859
min 12.598425 32.000000
25% 13.188976 33.500000
50% 13.779528 35.000000
75% 15.748031 40.000000
max 17.716535 45.000000
dog count 3.000000 3.000000
mean 19.685039 50.000000
std 5.905512 15.000000
min 13.779528 35.000000
25% 16.732283 42.500000
50% 19.685039 50.000000
75% 22.637795 57.500000
max 25.590551 65.000000
df.groupby('animal')['length'].sum()
animal
cat    112
dog    150
Name: length, dtype: int64
dogs['length'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x108c64358>

png

plt.style.use("ggplot")
dogs['length'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x108d21ac8>

png

#* Make a horizontal bar graph of the length of the animals, with their name as the label (look at the billionaires notebook I put on Slack!)
#* Make a sorted horizontal bar graph of the cats, with the larger cats on top.
df.plot(kind='bar', x='name', y='length')
<matplotlib.axes._subplots.AxesSubplot at 0x108ed2a58>

png

df.plot(kind='barh', x='name', y='length', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1090b78d0>

png

#* Make a sorted horizontal bar graph of the cats,
# with the larger cats on top.
df[df['animal'] == 'cat'].sort_values(by='length').plot(kind='barh', x='name', y='length', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x109be75c0>

png

Part 02: Doing some research (billionaires)

Answer your own selection out of the following questions, or any other questions you might be able to think of. Write the question down first in a markdown cell (use a # to make the question a nice header), THEN try to get an answer to it. A lot of these are remarkably similar, and some you’ll need to do manual work for - the GDP ones, for example.

If you are trying to figure out some other question that we didn’t cover in class and it does not have to do with joining to another data set, we’re happy to help you figure it out during lab!

  • Take a peek at the billionaires notebook I uploaded into Slack, it should be helpful for the graphs (I added a few other styles and options, too). You’ll probably also want to look at the “sum()” line I added.
  • What country are most billionaires from? For the top ones, how many billionaires per billion people?
  • Who are the top 10 richest billionaires?
  • What’s the average wealth of a billionaire? Male? Female?
  • Who is the poorest billionaire? Who are the top 10 poorest billionaires?
  • ‘What is relationship to company’? And what are the most common relationships?
  • Most common source of wealth? Male vs. female?
  • Given the richest person in a country, what % of the GDP is their wealth?
  • Add up the wealth of all of the billionaires in a given country (or a few countries) and then compare it to the GDP of the country, or other billionaires, so like pit the US vs India
  • What are the most common industries for billionaires to come from? What’s the total amount of billionaire money from each industry?
  • How many self made billionaires vs. others?
  • How old are billionaires? How old are billionaires self made vs. non self made? or different industries?
  • Who are the youngest billionaires? The oldest? Age distribution - maybe make a graph about it?
  • Maybe just made a graph about how wealthy they are in general?
  • Maybe plot their net worth vs age (scatterplot)
  • Make a bar graph of the top 10 or 20 richest