Working with many files in pandas
Dealing with files
Opening a file not in your notebook directory
You can use relative paths to use files not in your current notebook directory. Usually this means “start from the current directory, and go inside of a directory, and then find a file in there.”
# Open an excel file inside of the 'data' directory
pd.read_excel("data/cooldata.xlsx"../)
Renaming columns
If you only want to rename some columns, you can use a dictionary. Be sure to use inplace=True
to save the result!
df.rename(columns={'id_no': 'ID Number', 'cntry': 'Country'}, inplace=True)
If you know the order of the columns, you can just assign something on top of the dataframe’s columns
.
# You know that id is first and country is second
df.columns = ['ID Number', 'Country']
Selecting data
Selecting null or non-null rows
You can use .notnull()
and .isnull()
to select rows based on whether a column is missing data.
# Select everywhere that name is null
df[df.name.notnull()]
# Select everywhere that name is not null
df[df.name.isnull()]
Selecting columns that do NOT contain specific text
If you wanted to find rows where the country is a “-stan,” you can use .str.contains
.
# Select rows where Country ends in 'stan'
df[df.Country.str.contains("stan$")]
In order to get the opposite, you can use ~
to negate your condition (aka “do the opposite”).
df[~df.Country.str.contains("stan$")]
You can do this for many other things, but it’s most common with situations like .str.contains
.
Reading in many files
When you’re working with a lot of files (or spreadsheet pages, etc), you normally go through a four step process.
- Get a list of filenames (or sheets)
- Turn those into a list of dataframes
- Add the filename as a column to each dataframe
- Combine the dataframes into one big huge dataframe
STEP ONE: Getting a list of filenames
We use the glob
library to get a list of files matching a pattern. It’s such a fun name that I try to say it as much as possible.
import glob
# The * is not a regex, it just means "match anything"
# This matches datafile-0.csv, datafile-1.csv, etc.
glob.glob("datafile-*.csv")
You do not do this if you have an Excel sheet. Hopefully you just have them in a list, because I don’t really want to tell you about pd.ExcelFile
.
STEP TWO: Turn those into a list of dataframes
Working with a list of csv filenames
Now that we have a list of filenames, we can convert those into a list of dataframes with a list compehension. Each filename goes in, and the appropriate dataframe comes out.
list_of_dfs = [pd.read_csv(filename) for filename in filenames]
Working with a list of Excel sheet names
If you’re working with an Excel file, it’s basically the same, you just use the variable to change sheetname
.
sheetnames = ['Sheet1', 'Sheet2', 'Sheet3']
list_of_dfs = [pd.read_excel("datastuff.xlsx", sheetname=sheet) for sheet in sheetnames]
STEP THREE: Adding the filename to each dataframe
Each dataframe is from a specific file, and it’s usually helpful to take that filename and make it a column. Sometimes your filename is 2015-data.csv
and 2014-data.csv
but there is no ‘year’ column! So you need to take the filename, add it to a column, and then extract the year.
# zip loops through TWO THINGS AT ONCE
# so you're looking at dataframe #1 and filename #1
# then dataframe #2 and filename #2
# etc
# and assigning that filename as a new column in the dataframe
for dataframe, filename in zip(list_of_dfs, filenames)
dataframe['filename'] = filename
You can see if it worked by looking at list_of_dfs[0]
and list_of_dfs[-1]
to see if the first and last both have the filename in there.
After you merge them in the next step you’ll then be able to extract any useful data from the filename.
STEP FOUR: Combining multiple dataframes
Your best friend when stacking Python files on top of each other is pd.concat
. It accepts a list of dataframes. You have a list of dataframes already, so it’s going to be easy!
You should always add ignore_index=True
when using pd.concat
or else spooky bugs will come and eat you in your sleep. It re-indexes your data, otherwise if you have 100 dataframes you’re comabining you’ll have 100 things with index 0
and 100 things with index 1
and 100 things with index 2
. We don’t want that, so ignore_index=True
!
# Combine a list of dataframes, on top of each other
combined_df = pd.concat(list_of_dfs, ignore_index=True)
Deleting rows
Deleting rows based on a condition
The easiest way to delete a row isn’t to delete it, but just to create a new dataframe! For example, “delete rows where age is null” can be rephrased as “create a new dataframe where age is not null.”
# Find the rows where age isn't null
# And save them into your new dataframe
df = df[df.age.notnull()]
Deleting rows based on null values
You can also use .dropna()
to drop rows that have missing data. You usually use this with subset
to specify which columns to pay attention to.
# Drop all rows where age or name are missing
df.dropna(subset=['age', 'name'], inplace=True)
Be sure to use inplace=True
to save the result
Filling in missing data
Filling in all missing data
If you want to fill in some missing data, fillna
is great. It will take every NaN
and replace it with some data.
For example, let’s say you had a bunch of null values and you wanted to replace them with the word “Unknown”
# Use inplace=True to save it back to the dataframe
df['height'].fillna("Unknown", inplace=True)
This is similar to using .replace
with np.nan
.
import np
df['height'].replace(np.nan, "Unknown", inplace=True)
But fillna
doesn’t need an import!
Filling DOWN from a column through missing data
Let’s say we had something that looked like this
Neighborhood | Borough |
---|---|
NaN |
Brooklyn |
Prospect Heights | NaN |
Fort Greene | NaN |
Clinton Hill | NaN |
NaN |
Manhattan |
Murray Hill | NaN |
Flatiron | NaN |
And we wanted to pul down “Brooklyn” and “Manhattan” into the columns below it. This is very common when using data from Excel sheets, because people love to put weird headers everywhere and not continue to data down.
To make this work, you use fillna
, but instead of giving it a value you say “use the value above the missing data.”
df['Borough'].fillna(method='ffill', inplace=True)
Then you’ll end up with this:
Neighborhood | Borough |
---|---|
NaN |
Brooklyn |
Prospect Heights | Brooklyn |
Fort Greene | Brooklyn |
Clinton Hill | Brooklyn |
NaN |
Manhattan |
Murray Hill | Manhattan |
Flatiron | Manhattan |
And then you can delete those boring header rows!
It has other options, too, like filling up, etc.
Customizing your graphics
The matplotlib gallery is a great resource for kind-of-understandable code. Just be sure to read the comments and spend some time with them. For example, customizing ticks might be useful.
You can also check the actual documentation which is pretty good SOMETIMES, for example this page - you can just read it and figure out what’s possible!
Overview of plot customization
Sometimes you can set options when using .plot
…
df.plot(x='Country', y='GDP', kind='barh', title='GDP of different countires')
…or you can save the plot as an ax
and use it to make tweaks.
ax = df.plot(x='Country', y='GDP', kind='barh')
ax.set_title('GDP of different countries')
Honestly, the best way to do this is to type ax.set_
and press tab to see what your options might be. It’s usually better to try that first than fighting Stack Overflow.
Setting titles and axis labels
There are simple .set_
options available for you!
ax.set_title("This is the title")
ax.set_xlabel("Here is the x axis label")
ax.set_xlabel("Here is the y axis label")
The axes can also be done a little more complicatedly…
ax.xaxis.set_label("This is the x axis label")
ax.yaxis.set_label("This is the y axis label")
Removing the box around your graph (or aprt of it)
The box that’s around all of your stuff are called splines, and you can turn them off and on.
# If you wanted to turn all of them off
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
Adjusting ticks
Ticks are the little marks on the edge of each axis. You can move ticks to the top, bottom, left, right, or make them disappear with .set_ticks_position
.
# Put y axis ticks on the top
ax.yaxis.set_ticks_position('top')
# Remove the x axis ticks
ax.xaxis.set_ticks_position('none')
You can set the values that tick marks show up by using .set_ticks
ax.xaxis.set_ticks([100, 200, 300, 400])
Grid lines
Grids can be set with .grid
, where you can turn them off and on and use fun things like linestyle
and linewidth
# Turn on all grids
ax.grid(True)
# Turn on grids on the y axis
ax.grid(axis='y')
Format axis numbers
We’re going to use weird formats like ${x:,.0f}
to say “make this have a dollar sign and commas and even if it’s a floating point use 0 numbers after the decimal point.”
import matplotlib.ticker as ticker
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter("${x:,.0f}"))
You can play around with that to get what you need.
Setting max/min values
Maximum and minimum values on each axis can be set up with xlim
and ylim
.
# Send it when plotting
df.plot(x='age', y='gdp', xlim=(0, 100), y=(0,10000000))
# Use .set after you make the ax
ax = df.plot(x='age', y='gdp')
ax.set_xlim([0, 100])
ax.set_xlim([0, 10000000])
Not sure what your largest numbers are? You could do something like df.age.max()
to get the maximum, and then feed it into xlim
!
Turn off legends
df.plot(x='age', y='gdp', legend=False)
Combining dataframes
Stacking dataframes “on top” of one another
Given a list of dataframes, you can use df.concat
to push them all together. You also need to use ignore_index
to say “give them new indexes” so that you don’t have dozens of index 0
s and index 1
s and all of that.
combined = df.concat(list_of_dfs, ignore_index=True)
Pushing dataframes “next to” one another
We can also stack side by side, kind of like merge, by changing the axis we tell concat to pay attention to.
combined = df.concat(list_of_dfs, axis=1)
Extracting data from a single column
Extracting multiple columns out a single text column
We’ve used .str.extract
before to pull out one column and assign it to a column, but you can also do it with multiple pieces of information at the same time!
# expand=True is default, but I add it to keep it from yelling
df.sentence.str.extract('My name is (.*) and I like (.*)', expand=True)
Looking for multiple matches gives you a dataframe instead of a series, so you can’t just push them into a column in your original dataframe.
Making those a nice dataframe
In order to add those to a dataframe, you should probably rename the columns first.
extracted = df.sentence.str.extract('My name is (.*) and I like (.*)', expand=True)
extracted.columns = ['name', 'food']
Once you have extracted
, you can put it side-by-side with the original dataframe using concat
(see above) in order to create the new columns.