Class notes about dogs and beer - NaN, value

Reading in files

Not reading everything in

If you’d like to try to analyze a subset of your data first, you can pass nrows to read_csv

# Only read in 3000 rows
df = pd.read_csv("../dataset.csv", nrows=3000)

It doesn’t work for read_excel, though! In that case you can either save the file as a csv or read it all then, then take the first 3000.

# Read in the whole file, but then
# only use the first 3000
df = pd.read_excel("dataset.xlsx")
df = df.head(3000)

And yes, list slicing would work here, too.

Leaving one set of data out of a graph

Say you’re making a graph of the top industries in a graphic, you might use

# Plot the count of every industry
df['industry'].value_counts().plot()

But oh no, you don’t like the Oil industry! You want to get it out of the graphic! There are a few ways to do it, but the easiest is to **filter the oil companies out before you do your .value_counts().

# First, get rid of the oil industry
# by doing df[df['industry'] != 'Oil']
# then use ['industry'].value_counts()
# to get the counts of the remaining industries
df[df['industry'] != 'Oil']['industry'].value_counts().plot()

Count the number of rows that meet some condition

When you want to count the rows of a dataframe, you often use .shape. python # Get the number of rows of a dataframe df.shape

You can also do this after you filter!

# First, filter for people over age 64
# Then, get the row and column count
df[df['age'] > 64].shape

You can also use .count() but it’s a little different.

Working with strings

Converting strings to floats or ints

You can convert a string to a float by using astype.

# Convert height to a float
df['height'].astype(float)

Make sure you only have numbers before you do this. If you get an error, it’s probably because you have a percent sign or a measurement (lbs, mm, etc) or something else in there that can’t be converted to a number.

# Convert height to a float after getting
# rid of the measurement
df['height'].replace("cm","").astype(float)

It works for ints, too!

# Convert age to an int
df['age'].astype(int)

Replacing or removing part of a string

You can use .str.replace to replace part of a string with some other string.

# Replace "Mister" with "Mr"
df['name'] = df['name'].str.replace('Mister', 'Mr')

To remove something, just “replace” it with '', an empty string!

# Remove "lbs" completely
df['weight'] = df['weight'].str.replace('lbs.', '')

Exact matches

# Exactly match "Strawberry"
df['name'] == 'Strawberry'

Partial matches

# Match strawberry, blueberry, raspberry
# and any other 'berry'
df['name'].str.contains("berry")

But NaN values will be trouble! To ignore them, use na=False

# Match strawberry, blueberry, raspberry
# and any other 'berry'
# Does NOT have an error if a name is missing
df['name'].str.contains("berry", na=False)

Match many possibilities

df['name'].isin(['strawberry', 'blueberry', 'raspberry'])

This looks nicer when you do it with a separate variable

# Get an exact match with a few different options
berries = ['strawberry', 'blueberry', 'raspberry']
df['name'].isin(berries)

Value counts

If you ever get excited and think “I should use groupby!” you might first want to think : “wait maybe I should just use .value_counts() instead?”

Normal counting

.value_counts() is great for figuring out how often something shows up. Use it all of the time with non-numeric data.

# How many rows from each state?
df['state'].value_counts()

Get the counts as percentages

# How many rows from each state as a percent?
df['state'].value_counts(normalize=True)

Count missing data

By default, value_counts() ignores missing data. If you have 100 dogs and 25 are male and 25 are female, value_counts() will tell you the sex split is 50/50. If you’d like to know it’s really 50/25/25, use dropna=False.

# How many rows from each state as a percent?
df['state'].value_counts(dropna=False)

You can combine this with normalize=True if you’d like.

Graphing value_counts() with big numbers on top

When you do value_counts() and want the top values, you normally do this

# Get the 10 most popular states and graph them
df['state'].value_counts().sort_values(ascending=False).head().plot()

But if you try to graph it, the longest bars will be on the bottom! To fix that, you need to reverse your sort and use .tail() instead of head().

# Get the 10 most popular states and graph them
df['state'].value_counts().sort_values(ascending=True).tail().plot()

You could also just remove ascending=True since it’s the default.

Value counts for different categories

A popular way to phrase this is the most popular CATEGORY_X based on CATEGORY_Y (jobs based on gender, crime based on city, export based on country).

If you don’t use this code you’re going to get the top values for some categories and the lowest categories for some other values.

# Top 3 most popular occupations by gender
df.groupby("Gender")['Occupation'].value_counts().groupby(level=0).head(3)

You can also use nlargest instead of head.

Graphing

Plotting charts on top of each other

We didn’t talk about this in class, but we use ax=ax, my least favorite thing in the pandas world!

# First draw one plot, save it as "ax"
# Then draw another plot, passing it "ax=ax"
ax = df.plot(kind='scatter', x='Age', y='Min Income', c='yellow')
df.plot(kind='scatter', x='Age', y='Max Income', c='blue', ax=ax)

We’ll talk more about this in the future

Changing the number of bins in a histogram

By default, a histogram uses 10 bins. Use bins= to increase that number

# Plot a histogram of ages, with 30 bins
df['age'].hist(bins=30)

You can also use bins to set the specific bins

# Set specific boundaries for the bins
df['age'].hist(bins=[0, 18, 24, 34, 55, 64, 108])

If you want to get really fancy, you can use range to automatically generate even bins for a given range.

# Set bins to be between 0 and 100, with 10 step increments
# range(0, 101, 10) gives you
# [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# (you need to make it be 101 because the end isn't inclusive)
boundaries = range(0, 101, 10)
df['age'].hist(bins=boundaries)

Joining together data files

Using .merge to make SQL-like joins in pandas

Use .merge to combine a “left” and “right” dataframe - you can remember which one is which because “left” is the one on the… left. right_on and left_on are the names of the columns you’re matching up.

# match 'zip' from df
# to 'zip code' from other_df
# save it as a new dataframe
combined_df = df.merge(other_df, left_on="zip", right_on="zip code")
combined_df

Check out the documentation for more options.

Missing data

Missing data has a lot of names! NaN is the official pandas name for it, but we can also call it na or nan or null` or just “../missing.”

Changing values to `NaN` when reading in your file

When you’re reading in your file, there’s usually missing data. Sometimes the missing data isn’t empty, though, it’s “Unknown” or “-999” or “???”. You can tell pandas to replace this data with NaN when reading it in.

# Read in a CSV, forcing 'Unknown', '-999' and '???' to be NaN
df = pd.read_csv("my-file.csv", na_values=["Unknown", "-999", "???"])

Make sure it is always a list. If it isn’t a list, it won’t work.

To search for `NaN` values

To find something that’s NaN, you do not type == "NaN" - that’s a string! NaN is a special value, so it gets special ways to find it.

Filter a column to show NaN values

# Find NaN values
df['colname'].isnull()

Find a column to show values that are NOT NaN

# Find values that are **not** NaN
df['colname'].notnull()

Getting rid of `NaN` values with `.dropna()`

If you have a Series (the two-column ugly thing) and you want to get rid of NaN values, use .dropna()

# Get every name that is not NaN
df['Name'].dropna()

If you have a DataFrame (the nice table thing) and you want to get rid of every row with a NaN value, use .dropna() with a subset.

# Get rid of every row where Name is NaN
# BUT!!!! you don't overwrite df,
# df still has the NaN values in it
df.dropna(subset=['Name'])

If you want to permanently remove the NaN rows, use inplace=True

# Get rid of every row where Name is NaN
df.dropna(subset=['Name'], inplace=True)

Counting `NaN` values with `value_counts()`

By default, value_counts() ignores missing data. It might say “50% of your dogs are white and 50% of your dogs are black,” even if half of your dogs are missing colors!

# Have NaN pay attention to missing values
df['colname'].value_counts(dropna=False)

Replacing `NaN` values with other values

Maybe you think NaN values should be something else instead - “Unknown” or “0” or “Butterscotch.” You can fill in missing values with fillna. It does NOT save it back to the dataframe, though!

# Get everyone's name, but if it's NaN
# replace it with "Not sure"
df['Name'].fillna("Not sure")

To save it back to the dataframe, you can add inplace=True.

# Permanently replace all NaN names
# with the string "Not sure"
df['Name'].fillna("Not sure", inplace=True)

Send `NaN` to a certain position on a list

When sorting, you can easily tell NaN values to go to the top or the bottom.

df.sort_values(by='colname', na_position='first')
df.sort_values(by='colname', na_position='last')

or if you’re doing a value_counts(), you don’t need to give it a column name

df['colname'].value_counts().sort_values(na_position='first')
df['colname'].value_counts().sort_values(na_position='last')

Avoiding errors with `.str.contains` and missing data

By default, .str.contains gives you True for a match, False for not a match, and NaN for an NaN value. That NaN will screw things up if you’re trying to filter rows! To prevent this trouble, add na=False.

# Match strawberry, blueberry, raspberry
# and any other 'berry'
# Does NOT have an error if a name is missing
df['name'].str.contains("berry", na=False)