Editing HTML with BeautifulSoup to remove extra headers and junk rows for `pd.read_html`¶

The pd.read_html tool from pandas is great: if you give it a webpage, it will try to convert all of the tables on the page into dataframes.

When it works correctly it's a joy, but sometimes we end up with a bunch of extra junk data! Maybe it's more headers or more rows, or extra info hiding in each and every cell. Let's see how we can edit the HTML of a web page with BeautifulSoup to make the pd.read_html process go a bit more smoothly.

Using `pd.read_html`¶

We'll start by reading the table off of this page.

An ugly table

We'll just use pd.read_html to pull in the table. It shouldn't be any trouble, right??

In [36]:

            
                Copied!
                
import pandas as pd

tables = pd.read_html("https://jsoma.github.io/scraping-examples/ugly-table.html")
df = tables[0]

df.tail(10)
import pandas as pd

tables = pd.read_html("https://jsoma.github.io/scraping-examples/ugly-table.html")
df = tables[0]

df.tail(10)

Out[36]:

	Data pager 12345678910... 150959 items in 15096 pages
	(Estimated) Report Publish Date(s)	Report(s)	Event Date	Location	Make/Model	RegistrationNumber	NTSB No.	EventSeverity	Type of Air Carrier Operation and Carrier Name (Doing Business As)
	Data pager	Unnamed: 1_level_2	Unnamed: 2_level_2	Unnamed: 3_level_2	Unnamed: 4_level_2	Unnamed: 5_level_2	Unnamed: 6_level_2	Unnamed: 7_level_2	Unnamed: 8_level_2
	Data pager	Unnamed: 1_level_3	Unnamed: 2_level_3	Unnamed: 3_level_3	Unnamed: 4_level_3	Unnamed: 5_level_3	Unnamed: 6_level_3	Unnamed: 7_level_3	Unnamed: 8_level_3
7	Factual 01/15/2009Final 01/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Cherokee, AL	BELL 206L-3	N109AE	NYC08FA071	Fatal(3)	NaN
8	Factual 11/11/2008Final 12/03/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crowley, TX	ALTHOUSE RAF 2000 GTX	N136DG	DFW08LA052	Fatal(2)	NaN
9	Factual 01/29/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Gunnison, CO	Piper PA-46-310P	N33MF	DEN08CA045	Nonfatal	NaN
10	Factual 02/14/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Abingdon, IL	Ercoupe (Eng & Research Corp.) 415D	N94131	CHI08CA056	Nonfatal	NaN
11	Factual 02/15/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crystal Falls, MI	Cessna 172F	N8191U	CHI08CA058	Nonfatal	NaN
12	Factual 12/31/2008Final 07/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Venice, LA	Bell 206L1	N211EL	DFW08FA053	Fatal(1)	NaN
13	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...	Data pager 12345678910... Page size: select ...
14	Data pager	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	12345678910... Page size: select 150959 item...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	select	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Unfortunately, that was awful. Extra headers (multi-index headers!), extra rows, lots of Unnamed columns.

What happened? It turns out that there are other tables inside of the table, which makes their headers get confused with the headers of the main table. There was also some pagination down at the bottom which pandas mistook for real data.

tables in tables

Editing HTML and removing elements with BeautifulSoup¶

Sometimes you can use skiprows= and columns= and other parameters of pd.read_html to clean things up. But we're lazy! We want it all done in a single pass! To make things simple, we're going to remove the unwanted elements with BeautifulSoup before we send it to pd.read_html.

Instead of providing the URL to pd.read_html directly, we'll start by reading the page in from requests.

In [37]:

            
                Copied!
                
from bs4 import BeautifulSoup

response = requests.get("https://jsoma.github.io/scraping-examples/ugly-table.html")
doc = BeautifulSoup(response.text)
from bs4 import BeautifulSoup

response = requests.get("https://jsoma.github.io/scraping-examples/ugly-table.html")
doc = BeautifulSoup(response.text)

How should we select the tags are that we want to remove? Do they have classes or ids, or is it something more complex?

If we poke around a little bit, we see that the pagination on top is inside of a row with the class rgPager. If we scroll down to the bottom, we see it's also inside of .rgPager.

rgPager 1

rgPager 2

Now that we've identified them, it's time to remove them. In BeautifulSoup, you can remove elements with decompose().

In [38]:

            
                Copied!
                
# Select all elements with the class rgPager, then remove them
pagers = doc.select(".rgPager")
for pager in pagers:
    pager.decompose()
# Select all elements with the class rgPager, then remove them
pagers = doc.select(".rgPager")
for pager in pagers:
    pager.decompose()

Feeding the edited HTML to `pd.read_html`¶

The unwanted parts of the page have been removed, we can now send the HTML right to pandas. To convert the edited document to HTML we'll use str(doc).

Be sure to not use requests.text! That's the original, unedited document.

In [40]:

            
                Copied!
                
import pandas as pd

tables = pd.read_html(str(doc))
df = tables[0]
df
import pandas as pd

tables = pd.read_html(str(doc))
df = tables[0]
df

Out[40]:

	(Estimated) Report Publish Date(s)	Report(s)	Event Date	Location	Make/Model	RegistrationNumber	NTSB No.	EventSeverity	Type of Air Carrier Operation and Carrier Name (Doing Business As)
0	Foreign	Foreign (HTML)	12/31/2007	Guernsey, United Kingdom	Cessna T303	NaN	CHI08WA075	Nonfatal	NaN
1	Factual 01/25/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/31/2007	Santa Ana, CA	Piper PA-12	N2800D	SEA08CA056	Nonfatal	NaN
2	Factual 01/29/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Alexandria, MN	Lerohl RV-8	N5093F	CHI08CA057	Nonfatal	NaN
3	Factual 09/05/2008Final 06/20/2014	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Paso Robles, CA	Cirrus Design Corp. SR22	N254SR	LAX08FA043	Fatal(1)	NaN
4	Factual 01/15/2009Final 01/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Cherokee, AL	BELL 206L-3	N109AE	NYC08FA071	Fatal(3)	NaN
5	Factual 11/11/2008Final 12/03/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crowley, TX	ALTHOUSE RAF 2000 GTX	N136DG	DFW08LA052	Fatal(2)	NaN
6	Factual 01/29/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Gunnison, CO	Piper PA-46-310P	N33MF	DEN08CA045	Nonfatal	NaN
7	Factual 02/14/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Abingdon, IL	Ercoupe (Eng & Research Corp.) 415D	N94131	CHI08CA056	Nonfatal	NaN
8	Factual 02/15/2008Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crystal Falls, MI	Cessna 172F	N8191U	CHI08CA058	Nonfatal	NaN
9	Factual 12/31/2008Final 07/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Venice, LA	Bell 206L1	N211EL	DFW08FA053	Fatal(1)	NaN

A bonus edit¶

A troublesome part of this is the (Estimated) Report Publish Date(s) column, which seems all smashed together.

In [43]:

            
                Copied!
                
df['(Estimated) Report Publish Date(s)'].loc[1]
df['(Estimated) Report Publish Date(s)'].loc[1]

Out[43]:

'Factual 01/25/2008Final 02/28/2008'

This is happening because of the <br> elements. You can read more about fixing that on the br element solutions page, but for now we'll just replace each br with a newline.

In [44]:

            
                Copied!
                
brs = doc.find_all("br")
for br in brs:
    br.replace_with("\n")
brs = doc.find_all("br")
for br in brs:
    br.replace_with("\n")

Now when we take a look at our results, they're formatted in the way we'd expect!

In [48]:

            
                Copied!
                
tables = pd.read_html(str(doc))
df = tables[0]
df
tables = pd.read_html(str(doc))
df = tables[0]
df

Out[48]:

	(Estimated) Report Publish Date(s)	Report(s)	Event Date	Location	Make/Model	Registration Number	NTSB No.	Event Severity	Type of Air Carrier Operation and Carrier Name (Doing Business As)
0	Foreign	Foreign (HTML)	12/31/2007	Guernsey, United Kingdom	Cessna T303	NaN	CHI08WA075	Nonfatal	NaN
1	Factual 01/25/2008 Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/31/2007	Santa Ana, CA	Piper PA-12	N2800D	SEA08CA056	Nonfatal	NaN
2	Factual 01/29/2008 Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Alexandria, MN	Lerohl RV-8	N5093F	CHI08CA057	Nonfatal	NaN
3	Factual 09/05/2008 Final 06/20/2014	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Paso Robles, CA	Cirrus Design Corp. SR22	N254SR	LAX08FA043	Fatal(1)	NaN
4	Factual 01/15/2009 Final 01/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/30/2007	Cherokee, AL	BELL 206L-3	N109AE	NYC08FA071	Fatal(3)	NaN
5	Factual 11/11/2008 Final 12/03/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crowley, TX	ALTHOUSE RAF 2000 GTX	N136DG	DFW08LA052	Fatal(2)	NaN
6	Factual 01/29/2008 Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Gunnison, CO	Piper PA-46-310P	N33MF	DEN08CA045	Nonfatal	NaN
7	Factual 02/14/2008 Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Abingdon, IL	Ercoupe (Eng & Research Corp.) 415D	N94131	CHI08CA056	Nonfatal	NaN
8	Factual 02/15/2008 Final 02/28/2008	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Crystal Falls, MI	Cessna 172F	N8191U	CHI08CA058	Nonfatal	NaN
9	Factual 12/31/2008 Final 07/15/2009	Final Report PDF \| HTML Data Summary (PDF)	12/29/2007	Venice, LA	Bell 206L1	N211EL	DFW08FA053	Fatal(1)	NaN

Works great!

Depending on what you're looking for and how you like to work with your data, you might want to replace the <br> with a comma, space, or anything else.

In [ ]:

Editing HTML with BeautifulSoup to remove extra headers and junk rows for pd.read_html¶

Using pd.read_html¶

Editing HTML and removing elements with BeautifulSoup¶

Feeding the edited HTML to pd.read_html¶

A bonus edit¶

Editing HTML with BeautifulSoup to remove extra headers and junk rows for `pd.read_html`¶

Using `pd.read_html`¶

Feeding the edited HTML to `pd.read_html`¶