Working with many files in pandas

Dealing with files

Opening a file not in your notebook directory

You can use relative paths to use files not in your current notebook directory. Usually this means “start from the current directory, and go inside of a directory, and then find a file in there.”

# Open an excel file inside of the 'data' directory
pd.read_excel("data/cooldata.xlsx"../)

Renaming columns

If you only want to rename some columns, you can use a dictionary. Be sure to use inplace=True to save the result!

df.rename(columns={'id_no': 'ID Number', 'cntry': 'Country'}, inplace=True)

If you know the order of the columns, you can just assign something on top of the dataframe’s columns.

# You know that id is first and country is second
df.columns = ['ID Number', 'Country']

Selecting data

Selecting null or non-null rows

You can use .notnull() and .isnull() to select rows based on whether a column is missing data.

# Select everywhere that name is null
df[df.name.notnull()]

# Select everywhere that name is not null
df[df.name.isnull()]

Selecting columns that do NOT contain specific text

If you wanted to find rows where the country is a “-stan,” you can use .str.contains.

# Select rows where Country ends in 'stan'
df[df.Country.str.contains("stan$")]

In order to get the opposite, you can use ~ to negate your condition (aka “do the opposite”).

df[~df.Country.str.contains("stan$")]

You can do this for many other things, but it’s most common with situations like .str.contains.

Reading in many files

When you’re working with a lot of files (or spreadsheet pages, etc), you normally go through a four step process.

Get a list of filenames (or sheets)
Turn those into a list of dataframes
Add the filename as a column to each dataframe
Combine the dataframes into one big huge dataframe

STEP ONE: Getting a list of filenames

We use the glob library to get a list of files matching a pattern. It’s such a fun name that I try to say it as much as possible.

import glob

# The * is not a regex, it just means "match anything"
# This matches datafile-0.csv, datafile-1.csv, etc.
glob.glob("datafile-*.csv")

You do not do this if you have an Excel sheet. Hopefully you just have them in a list, because I don’t really want to tell you about pd.ExcelFile.

STEP TWO: Turn those into a list of dataframes

Working with a list of csv filenames

Now that we have a list of filenames, we can convert those into a list of dataframes with a list compehension. Each filename goes in, and the appropriate dataframe comes out.

list_of_dfs = [pd.read_csv(filename) for filename in filenames]

Working with a list of Excel sheet names

If you’re working with an Excel file, it’s basically the same, you just use the variable to change sheetname.

sheetnames = ['Sheet1', 'Sheet2', 'Sheet3']
list_of_dfs = [pd.read_excel("datastuff.xlsx", sheetname=sheet) for sheet in sheetnames]

STEP THREE: Adding the filename to each dataframe

Each dataframe is from a specific file, and it’s usually helpful to take that filename and make it a column. Sometimes your filename is 2015-data.csv and 2014-data.csv but there is no ‘year’ column! So you need to take the filename, add it to a column, and then extract the year.

# zip loops through TWO THINGS AT ONCE
# so you're looking at dataframe #1 and filename #1
# then dataframe #2 and filename #2
# etc
# and assigning that filename as a new column in the dataframe
for dataframe, filename in zip(list_of_dfs, filenames)
  dataframe['filename'] = filename

You can see if it worked by looking at list_of_dfs[0] and list_of_dfs[-1] to see if the first and last both have the filename in there.

After you merge them in the next step you’ll then be able to extract any useful data from the filename.

STEP FOUR: Combining multiple dataframes

Your best friend when stacking Python files on top of each other is pd.concat. It accepts a list of dataframes. You have a list of dataframes already, so it’s going to be easy!

You should always add ignore_index=True when using pd.concat or else spooky bugs will come and eat you in your sleep. It re-indexes your data, otherwise if you have 100 dataframes you’re comabining you’ll have 100 things with index 0 and 100 things with index 1 and 100 things with index 2. We don’t want that, so ignore_index=True!

# Combine a list of dataframes, on top of each other
combined_df = pd.concat(list_of_dfs, ignore_index=True)

Deleting rows

Deleting rows based on a condition

The easiest way to delete a row isn’t to delete it, but just to create a new dataframe! For example, “delete rows where age is null” can be rephrased as “create a new dataframe where age is not null.”

# Find the rows where age isn't null
# And save them into your new dataframe
df = df[df.age.notnull()]

Deleting rows based on null values

You can also use .dropna() to drop rows that have missing data. You usually use this with subset to specify which columns to pay attention to.

# Drop all rows where age or name are missing
df.dropna(subset=['age', 'name'], inplace=True)

Be sure to use inplace=True to save the result

Filling in missing data

Filling in all missing data

If you want to fill in some missing data, fillna is great. It will take every NaN and replace it with some data.

For example, let’s say you had a bunch of null values and you wanted to replace them with the word “Unknown”

# Use inplace=True to save it back to the dataframe
df['height'].fillna("Unknown", inplace=True)

This is similar to using .replace with np.nan.

import np
df['height'].replace(np.nan, "Unknown", inplace=True)

But fillna doesn’t need an import!

Filling DOWN from a column through missing data

Let’s say we had something that looked like this

Neighborhood	Borough
`NaN`	Brooklyn
Prospect Heights	`NaN`
Fort Greene	`NaN`
Clinton Hill	`NaN`
`NaN`	Manhattan
Murray Hill	`NaN`
Flatiron	`NaN`

And we wanted to pul down “Brooklyn” and “Manhattan” into the columns below it. This is very common when using data from Excel sheets, because people love to put weird headers everywhere and not continue to data down.

To make this work, you use fillna, but instead of giving it a value you say “use the value above the missing data.”

df['Borough'].fillna(method='ffill', inplace=True)

Then you’ll end up with this:

Neighborhood	Borough
`NaN`	Brooklyn
Prospect Heights	Brooklyn
Fort Greene	Brooklyn
Clinton Hill	Brooklyn
`NaN`	Manhattan
Murray Hill	Manhattan
Flatiron	Manhattan

And then you can delete those boring header rows!

It has other options, too, like filling up, etc.

Customizing your graphics

The matplotlib gallery is a great resource for kind-of-understandable code. Just be sure to read the comments and spend some time with them. For example, customizing ticks might be useful.

You can also check the actual documentation which is pretty good SOMETIMES, for example this page - you can just read it and figure out what’s possible!

Overview of plot customization

Sometimes you can set options when using .plot…

df.plot(x='Country', y='GDP', kind='barh', title='GDP of different countires')

…or you can save the plot as an ax and use it to make tweaks.

ax = df.plot(x='Country', y='GDP', kind='barh')
ax.set_title('GDP of different countries')

Honestly, the best way to do this is to type ax.set_ and press tab to see what your options might be. It’s usually better to try that first than fighting Stack Overflow.

Setting titles and axis labels

There are simple .set_ options available for you!

ax.set_title("This is the title")
ax.set_xlabel("Here is the x axis label")
ax.set_xlabel("Here is the y axis label")

The axes can also be done a little more complicatedly…

ax.xaxis.set_label("This is the x axis label")
ax.yaxis.set_label("This is the y axis label")

Removing the box around your graph (or aprt of it)

The box that’s around all of your stuff are called splines, and you can turn them off and on.

# If you wanted to turn all of them off
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

Adjusting ticks

Ticks are the little marks on the edge of each axis. You can move ticks to the top, bottom, left, right, or make them disappear with .set_ticks_position.

# Put y axis ticks on the top
ax.yaxis.set_ticks_position('top')
# Remove the x axis ticks
ax.xaxis.set_ticks_position('none')

You can set the values that tick marks show up by using .set_ticks

ax.xaxis.set_ticks([100, 200, 300, 400])

Grid lines

Grids can be set with .grid, where you can turn them off and on and use fun things like linestyle and linewidth

# Turn on all grids
ax.grid(True)
# Turn on grids on the y axis
ax.grid(axis='y')

Format axis numbers

We’re going to use weird formats like ${x:,.0f} to say “make this have a dollar sign and commas and even if it’s a floating point use 0 numbers after the decimal point.”

import matplotlib.ticker as ticker
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter("${x:,.0f}"))

You can play around with that to get what you need.

Setting max/min values

Maximum and minimum values on each axis can be set up with xlim and ylim.

# Send it when plotting
df.plot(x='age', y='gdp', xlim=(0, 100), y=(0,10000000))

# Use .set after you make the ax
ax = df.plot(x='age', y='gdp')
ax.set_xlim([0, 100])
ax.set_xlim([0, 10000000])

Not sure what your largest numbers are? You could do something like df.age.max() to get the maximum, and then feed it into xlim!

Turn off legends

df.plot(x='age', y='gdp', legend=False)

Combining dataframes

Stacking dataframes “on top” of one another

Given a list of dataframes, you can use df.concat to push them all together. You also need to use ignore_index to say “give them new indexes” so that you don’t have dozens of index 0s and index 1s and all of that.

combined = df.concat(list_of_dfs, ignore_index=True)

Pushing dataframes “next to” one another

We can also stack side by side, kind of like merge, by changing the axis we tell concat to pay attention to.

combined = df.concat(list_of_dfs, axis=1)

Extracting data from a single column

Extracting multiple columns out a single text column

We’ve used .str.extract before to pull out one column and assign it to a column, but you can also do it with multiple pieces of information at the same time!

# expand=True is default, but I add it to keep it from yelling
df.sentence.str.extract('My name is (.*) and I like (.*)', expand=True)

Looking for multiple matches gives you a dataframe instead of a series, so you can’t just push them into a column in your original dataframe.

Making those a nice dataframe

In order to add those to a dataframe, you should probably rename the columns first.

extracted = df.sentence.str.extract('My name is (.*) and I like (.*)', expand=True)
extracted.columns = ['name', 'food']

Once you have extracted, you can put it side-by-side with the original dataframe using concat (see above) in order to create the new columns.