Combining Excel tabs or sheets with pandas¶
Sometimes you have an Excel file with a lot of tabs (sheets, I think, technically?) and want to combine them all into one dataframe. The below will work as long as they all have the same column names.
We'll start by reading in the pd.ExcelFile, which is not the same as pd.read_excel.
import pandas as pd
# Notice this is not pd.read_excel, which only reads in one sheet
spreadsheet = pd.ExcelFile('multiple-sheets.xlsx')
spreadsheet
<pandas.io.excel._base.ExcelFile at 0x127852a40>
You can use .sheet_names to get a list of all of the sheets in the spreadsheet, and .parse(...) to grab individual spreadsheets.
# What are all of my sheets?
spreadsheet.sheet_names
['Sheet1', 'Sheet2']
# If you want one sheet
spreadsheet.parse('Sheet1')
| name | amount | |
|---|---|---|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | c | 3 |
To combine the spreadsheets, we'll do three steps:
- Read in each tab (or sheet), so we have a list of dataframes
- Add in the name of the tab as a new column called
source_sheet(not necessary, but often useful!) - Combine them all with
pd.concat.
We'll start by reading them in. If you have a list of specific sheets you'd like to combine, you could use sheet_names = ['Sheet1', 'Sheet2'] instead of getting all of the sheets.
# Read in each sheet
sheet_names = spreadsheet.sheet_names
sheets = [spreadsheet.parse(name) for name in sheet_names]
# Add a new column so you know which sheet it came from
for sheet_name, sheet in zip(sheet_names, sheets):
sheet['source_sheet'] = sheet_name
# We have two spreadsheets!
len(sheets)
2
Since sheets is just a list, we can examine each spreadsheets individually with [0], [1], etc. Compare this to up above where we had to use their actual sheet name.
sheets[0]
| name | amount | source_sheet | |
|---|---|---|---|
| 0 | a | 1 | Sheet1 |
| 1 | b | 2 | Sheet1 |
| 2 | c | 3 | Sheet1 |
sheets[1]
| name | amount | source_sheet | |
|---|---|---|---|
| 0 | d | 4 | Sheet2 |
| 1 | e | 5 | Sheet2 |
| 2 | f | 6 | Sheet2 |
Notice how they're each a dataframe? That means we can use pd.concat to combine them into a single dataframe. We make sure to use ignore_index=True so that the new dataframe doesn't have any repeating indices.
# Combine them into one dataframe
df = pd.concat(sheets, ignore_index=True)
df
| name | amount | source_sheet | |
|---|---|---|---|
| 0 | a | 1 | Sheet1 |
| 1 | b | 2 | Sheet1 |
| 2 | c | 3 | Sheet1 |
| 3 | d | 4 | Sheet2 |
| 4 | e | 5 | Sheet2 |
| 5 | f | 6 | Sheet2 |