Combining Excel tabs or sheets with pandas¶
Sometimes you have an Excel file with a lot of tabs (sheets, I think, technically?) and want to combine them all into one dataframe. The below will work as long as they all have the same column names.
We'll start by reading in the pd.ExcelFile
, which is not the same as pd.read_excel
.
import pandas as pd
# Notice this is not pd.read_excel, which only reads in one sheet
spreadsheet = pd.ExcelFile('multiple-sheets.xlsx')
spreadsheet
<pandas.io.excel._base.ExcelFile at 0x127852a40>
You can use .sheet_names
to get a list of all of the sheets in the spreadsheet, and .parse(...)
to grab individual spreadsheets.
# What are all of my sheets?
spreadsheet.sheet_names
['Sheet1', 'Sheet2']
# If you want one sheet
spreadsheet.parse('Sheet1')
name | amount | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
2 | c | 3 |
To combine the spreadsheets, we'll do three steps:
- Read in each tab (or sheet), so we have a list of dataframes
- Add in the name of the tab as a new column called
source_sheet
(not necessary, but often useful!) - Combine them all with
pd.concat
.
We'll start by reading them in. If you have a list of specific sheets you'd like to combine, you could use sheet_names = ['Sheet1', 'Sheet2']
instead of getting all of the sheets.
# Read in each sheet
sheet_names = spreadsheet.sheet_names
sheets = [spreadsheet.parse(name) for name in sheet_names]
# Add a new column so you know which sheet it came from
for sheet_name, sheet in zip(sheet_names, sheets):
sheet['source_sheet'] = sheet_name
# We have two spreadsheets!
len(sheets)
2
Since sheets
is just a list, we can examine each spreadsheets individually with [0]
, [1]
, etc. Compare this to up above where we had to use their actual sheet name.
sheets[0]
name | amount | source_sheet | |
---|---|---|---|
0 | a | 1 | Sheet1 |
1 | b | 2 | Sheet1 |
2 | c | 3 | Sheet1 |
sheets[1]
name | amount | source_sheet | |
---|---|---|---|
0 | d | 4 | Sheet2 |
1 | e | 5 | Sheet2 |
2 | f | 6 | Sheet2 |
Notice how they're each a dataframe? That means we can use pd.concat
to combine them into a single dataframe. We make sure to use ignore_index=True
so that the new dataframe doesn't have any repeating indices.
# Combine them into one dataframe
df = pd.concat(sheets, ignore_index=True)
df
name | amount | source_sheet | |
---|---|---|---|
0 | a | 1 | Sheet1 |
1 | b | 2 | Sheet1 |
2 | c | 3 | Sheet1 |
3 | d | 4 | Sheet2 |
4 | e | 5 | Sheet2 |
5 | f | 6 | Sheet2 |