Class notes about dogs and beer - NaN, value_counts, converting, and more
Reading in files
Not reading everything in
If you’d like to try to analyze a subset of your data first, you can pass nrows
to read_csv
# Only read in 3000 rows
df = pd.read_csv("../dataset.csv", nrows=3000)
It doesn’t work for read_excel
, though! In that case you can either save the file as a csv or read it all then, then take the first 3000.
# Read in the whole file, but then
# only use the first 3000
df = pd.read_excel("dataset.xlsx")
df = df.head(3000)
And yes, list slicing would work here, too.
Leaving one set of data out of a graph
Say you’re making a graph of the top industries in a graphic, you might use
# Plot the count of every industry
df['industry'].value_counts().plot()
But oh no, you don’t like the Oil
industry! You want to get it out of the graphic! There are a few ways to do it, but the easiest is to **filter the oil companies out before you do your .value_counts()
.
# First, get rid of the oil industry
# by doing df[df['industry'] != 'Oil']
# then use ['industry'].value_counts()
# to get the counts of the remaining industries
df[df['industry'] != 'Oil']['industry'].value_counts().plot()
Count the number of rows that meet some condition
When you want to count the rows of a dataframe, you often use .shape
.
python
# Get the number of rows of a dataframe
df.shape
You can also do this after you filter!
# First, filter for people over age 64
# Then, get the row and column count
df[df['age'] > 64].shape
You can also use .count()
but it’s a little different.
Working with strings
Converting strings to floats or ints
You can convert a string to a float by using astype
.
# Convert height to a float
df['height'].astype(float)
Make sure you only have numbers before you do this. If you get an error, it’s probably because you have a percent sign or a measurement (lbs
, mm
, etc) or something else in there that can’t be converted to a number.
# Convert height to a float after getting
# rid of the measurement
df['height'].replace("cm","").astype(float)
It works for ints, too!
# Convert age to an int
df['age'].astype(int)
Replacing or removing part of a string
You can use .str.replace
to replace part of a string with some other string.
# Replace "Mister" with "Mr"
df['name'] = df['name'].str.replace('Mister', 'Mr')
To remove something, just “replace” it with ''
, an empty string!
# Remove "lbs" completely
df['weight'] = df['weight'].str.replace('lbs.', '')
Exact matches
# Exactly match "Strawberry"
df['name'] == 'Strawberry'
Partial matches
# Match strawberry, blueberry, raspberry
# and any other 'berry'
df['name'].str.contains("berry")
But NaN values will be trouble! To ignore them, use na=False
# Match strawberry, blueberry, raspberry
# and any other 'berry'
# Does NOT have an error if a name is missing
df['name'].str.contains("berry", na=False)
Match many possibilities
df['name'].isin(['strawberry', 'blueberry', 'raspberry'])
This looks nicer when you do it with a separate variable
# Get an exact match with a few different options
berries = ['strawberry', 'blueberry', 'raspberry']
df['name'].isin(berries)
Value counts
If you ever get excited and think “I should use groupby
!” you might first want to think : “wait maybe I should just use .value_counts()
instead?”
Normal counting
.value_counts()
is great for figuring out how often something shows up. Use it all of the time with non-numeric data.
# How many rows from each state?
df['state'].value_counts()
Get the counts as percentages
# How many rows from each state as a percent?
df['state'].value_counts(normalize=True)
Count missing data
By default, value_counts()
ignores missing data. If you have 100 dogs and 25 are male and 25 are female, value_counts()
will tell you the sex split is 50/50. If you’d like to know it’s really 50/25/25, use dropna=False
.
# How many rows from each state as a percent?
df['state'].value_counts(dropna=False)
You can combine this with normalize=True
if you’d like.
Graphing value_counts() with big numbers on top
When you do value_counts()
and want the top values, you normally do this
# Get the 10 most popular states and graph them
df['state'].value_counts().sort_values(ascending=False).head().plot()
But if you try to graph it, the longest bars will be on the bottom! To fix that, you need to reverse your sort and use .tail()
instead of head()
.
# Get the 10 most popular states and graph them
df['state'].value_counts().sort_values(ascending=True).tail().plot()
You could also just remove ascending=True
since it’s the default.
Value counts for different categories
A popular way to phrase this is the most popular CATEGORY_X based on CATEGORY_Y (jobs based on gender, crime based on city, export based on country).
If you don’t use this code you’re going to get the top values for some categories and the lowest categories for some other values.
# Top 3 most popular occupations by gender
df.groupby("Gender")['Occupation'].value_counts().groupby(level=0).head(3)
You can also use nlargest
instead of head
.
Graphing
Plotting charts on top of each other
We didn’t talk about this in class, but we use ax=ax
, my least favorite thing in the pandas world!
# First draw one plot, save it as "ax"
# Then draw another plot, passing it "ax=ax"
ax = df.plot(kind='scatter', x='Age', y='Min Income', c='yellow')
df.plot(kind='scatter', x='Age', y='Max Income', c='blue', ax=ax)
We’ll talk more about this in the future
Changing the number of bins in a histogram
By default, a histogram uses 10 bins. Use bins=
to increase that number
# Plot a histogram of ages, with 30 bins
df['age'].hist(bins=30)
You can also use bins
to set the specific bins
# Set specific boundaries for the bins
df['age'].hist(bins=[0, 18, 24, 34, 55, 64, 108])
If you want to get really fancy, you can use range
to automatically generate even bins for a given range.
# Set bins to be between 0 and 100, with 10 step increments
# range(0, 101, 10) gives you
# [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# (you need to make it be 101 because the end isn't inclusive)
boundaries = range(0, 101, 10)
df['age'].hist(bins=boundaries)
Joining together data files
Using .merge to make SQL-like joins in pandas
Use .merge
to combine a “left” and “right” dataframe - you can remember which one is which because “left” is the one on the… left. right_on
and left_on
are the names of the columns you’re matching up.
# match 'zip' from df
# to 'zip code' from other_df
# save it as a new dataframe
combined_df = df.merge(other_df, left_on="zip", right_on="zip code")
combined_df
Check out the documentation for more options.
Missing data
Missing data has a lot of names! NaN
is the official pandas name for it, but we can also call it na
or nan
or null` or just “../missing.”
Changing values to NaN
when reading in your file
When you’re reading in your file, there’s usually missing data. Sometimes the missing data isn’t empty, though, it’s “Unknown” or “-999” or “???”. You can tell pandas to replace this data with NaN
when reading it in.
# Read in a CSV, forcing 'Unknown', '-999' and '???' to be NaN
df = pd.read_csv("my-file.csv", na_values=["Unknown", "-999", "???"])
Make sure it is always a list. If it isn’t a list, it won’t work.
To search for NaN
values
To find something that’s NaN
, you do not type == "NaN"
- that’s a string! NaN
is a special value, so it gets special ways to find it.
Filter a column to show NaN
values
# Find NaN values
df['colname'].isnull()
Find a column to show values that are NOT NaN
# Find values that are **not** NaN
df['colname'].notnull()
Getting rid of NaN
values with .dropna()
If you have a Series
(the two-column ugly thing) and you want to get rid of NaN values, use .dropna()
# Get every name that is not NaN
df['Name'].dropna()
If you have a DataFrame
(the nice table thing) and you want to get rid of every row with a NaN value, use .dropna()
with a subset.
# Get rid of every row where Name is NaN
# BUT!!!! you don't overwrite df,
# df still has the NaN values in it
df.dropna(subset=['Name'])
If you want to permanently remove the NaN
rows, use inplace=True
# Get rid of every row where Name is NaN
df.dropna(subset=['Name'], inplace=True)
Counting NaN
values with value_counts()
By default, value_counts()
ignores missing data. It might say “50% of your dogs are white and 50% of your dogs are black,” even if half of your dogs are missing colors!
# Have NaN pay attention to missing values
df['colname'].value_counts(dropna=False)
Replacing NaN
values with other values
Maybe you think NaN
values should be something else instead - “Unknown” or “0” or “Butterscotch.” You can fill in missing values with fillna
. It does NOT save it back to the dataframe, though!
# Get everyone's name, but if it's NaN
# replace it with "Not sure"
df['Name'].fillna("Not sure")
To save it back to the dataframe, you can add inplace=True
.
# Permanently replace all NaN names
# with the string "Not sure"
df['Name'].fillna("Not sure", inplace=True)
Send NaN
to a certain position on a list
When sorting, you can easily tell NaN
values to go to the top or the bottom.
df.sort_values(by='colname', na_position='first')
df.sort_values(by='colname', na_position='last')
or if you’re doing a value_counts()
, you don’t need to give it a column name
df['colname'].value_counts().sort_values(na_position='first')
df['colname'].value_counts().sort_values(na_position='last')
Avoiding errors with .str.contains
and missing data
By default, .str.contains
gives you True
for a match, False
for not a match, and NaN
for an NaN
value. That NaN
will screw things up if you’re trying to filter rows! To prevent this trouble, add na=False
.
# Match strawberry, blueberry, raspberry
# and any other 'berry'
# Does NOT have an error if a name is missing
df['name'].str.contains("berry", na=False)