Combining Excel tabs or sheets with pandas¶

Sometimes you have an Excel file with a lot of tabs (sheets, I think, technically?) and want to combine them all into one dataframe. The below will work as long as they all have the same column names.

We'll start by reading in the pd.ExcelFile, which is not the same as pd.read_excel.

In [20]:

            
                Copied!
                
import pandas as pd

# Notice this is not pd.read_excel, which only reads in one sheet
spreadsheet = pd.ExcelFile('multiple-sheets.xlsx')
spreadsheet
import pandas as pd

# Notice this is not pd.read_excel, which only reads in one sheet
spreadsheet = pd.ExcelFile('multiple-sheets.xlsx')
spreadsheet

Out[20]:

<pandas.io.excel._base.ExcelFile at 0x127852a40>

You can use .sheet_names to get a list of all of the sheets in the spreadsheet, and .parse(...) to grab individual spreadsheets.

In [26]:

            
                Copied!
                
# What are all of my sheets?
spreadsheet.sheet_names
# What are all of my sheets?
spreadsheet.sheet_names

Out[26]:

['Sheet1', 'Sheet2']

In [27]:

            
                Copied!
                
# If you want one sheet
spreadsheet.parse('Sheet1')
# If you want one sheet
spreadsheet.parse('Sheet1')

Out[27]:

	name	amount
0	a	1
1	b	2
2	c	3

To combine the spreadsheets, we'll do three steps:

Read in each tab (or sheet), so we have a list of dataframes
Add in the name of the tab as a new column called source_sheet (not necessary, but often useful!)
Combine them all with pd.concat.

We'll start by reading them in. If you have a list of specific sheets you'd like to combine, you could use sheet_names = ['Sheet1', 'Sheet2'] instead of getting all of the sheets.

In [33]:

            
                Copied!
                
# Read in each sheet
sheet_names = spreadsheet.sheet_names
sheets = [spreadsheet.parse(name) for name in sheet_names]

# Add a new column so you know which sheet it came from
for sheet_name, sheet in zip(sheet_names, sheets):
    sheet['source_sheet'] = sheet_name

# We have two spreadsheets!
len(sheets)
# Read in each sheet
sheet_names = spreadsheet.sheet_names
sheets = [spreadsheet.parse(name) for name in sheet_names]

# Add a new column so you know which sheet it came from
for sheet_name, sheet in zip(sheet_names, sheets):
    sheet['source_sheet'] = sheet_name

# We have two spreadsheets!
len(sheets)

Out[33]:

Since sheets is just a list, we can examine each spreadsheets individually with [0], [1], etc. Compare this to up above where we had to use their actual sheet name.

In [34]:

            
                Copied!
                
sheets[0]
sheets[0]

Out[34]:

	name	amount	source_sheet
0	a	1	Sheet1
1	b	2	Sheet1
2	c	3	Sheet1

In [35]:

            
                Copied!
                
sheets[1]
sheets[1]

Out[35]:

	name	amount	source_sheet
0	d	4	Sheet2
1	e	5	Sheet2
2	f	6	Sheet2

Notice how they're each a dataframe? That means we can use pd.concat to combine them into a single dataframe. We make sure to use ignore_index=True so that the new dataframe doesn't have any repeating indices.

In [24]:

            
                Copied!
                
# Combine them into one dataframe

df = pd.concat(sheets, ignore_index=True)
df
# Combine them into one dataframe

df = pd.concat(sheets, ignore_index=True)
df

Out[24]:

	name	amount	source_sheet
0	a	1	Sheet1
1	b	2	Sheet1
2	c	3	Sheet1
3	d	4	Sheet2
4	e	5	Sheet2
5	f	6	Sheet2