Editing HTML with BeautifulSoup to remove extra headers and junk rows for pd.read_html
¶
The pd.read_html
tool from pandas is great: if you give it a webpage, it will try to convert all of the tables on the page into dataframes.
When it works correctly it's a joy, but sometimes we end up with a bunch of extra junk data! Maybe it's more headers or more rows, or extra info hiding in each and every cell. Let's see how we can edit the HTML of a web page with BeautifulSoup to make the pd.read_html
process go a bit more smoothly.
Using pd.read_html
¶
We'll start by reading the table off of this page.
We'll just use pd.read_html
to pull in the table. It shouldn't be any trouble, right??
import pandas as pd
tables = pd.read_html("https://jsoma.github.io/scraping-examples/ugly-table.html")
df = tables[0]
df.tail(10)
Data pager 12345678910... 150959 items in 15096 pages | |||||||||
---|---|---|---|---|---|---|---|---|---|
(Estimated) Report Publish Date(s) | Report(s) | Event Date | Location | Make/Model | RegistrationNumber | NTSB No. | EventSeverity | Type of Air Carrier Operation and Carrier Name (Doing Business As) | |
Data pager | Unnamed: 1_level_2 | Unnamed: 2_level_2 | Unnamed: 3_level_2 | Unnamed: 4_level_2 | Unnamed: 5_level_2 | Unnamed: 6_level_2 | Unnamed: 7_level_2 | Unnamed: 8_level_2 | |
Data pager | Unnamed: 1_level_3 | Unnamed: 2_level_3 | Unnamed: 3_level_3 | Unnamed: 4_level_3 | Unnamed: 5_level_3 | Unnamed: 6_level_3 | Unnamed: 7_level_3 | Unnamed: 8_level_3 | |
7 | Factual 01/15/2009Final 01/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Cherokee, AL | BELL 206L-3 | N109AE | NYC08FA071 | Fatal(3) | NaN |
8 | Factual 11/11/2008Final 12/03/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crowley, TX | ALTHOUSE RAF 2000 GTX | N136DG | DFW08LA052 | Fatal(2) | NaN |
9 | Factual 01/29/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Gunnison, CO | Piper PA-46-310P | N33MF | DEN08CA045 | Nonfatal | NaN |
10 | Factual 02/14/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Abingdon, IL | Ercoupe (Eng & Research Corp.) 415D | N94131 | CHI08CA056 | Nonfatal | NaN |
11 | Factual 02/15/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crystal Falls, MI | Cessna 172F | N8191U | CHI08CA058 | Nonfatal | NaN |
12 | Factual 12/31/2008Final 07/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Venice, LA | Bell 206L1 | N211EL | DFW08FA053 | Fatal(1) | NaN |
13 | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... | Data pager 12345678910... Page size: select ... |
14 | Data pager | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | 12345678910... Page size: select 150959 item... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | NaN | select | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Unfortunately, that was awful. Extra headers (multi-index headers!), extra rows, lots of Unnamed
columns.
What happened? It turns out that there are other tables inside of the table, which makes their headers get confused with the headers of the main table. There was also some pagination down at the bottom which pandas mistook for real data.
Editing HTML and removing elements with BeautifulSoup¶
Sometimes you can use skiprows=
and columns=
and other parameters of pd.read_html
to clean things up. But we're lazy! We want it all done in a single pass! To make things simple, we're going to remove the unwanted elements with BeautifulSoup before we send it to pd.read_html
.
Instead of providing the URL to pd.read_html
directly, we'll start by reading the page in from requests
.
from bs4 import BeautifulSoup
response = requests.get("https://jsoma.github.io/scraping-examples/ugly-table.html")
doc = BeautifulSoup(response.text)
How should we select the tags are that we want to remove? Do they have classes or ids, or is it something more complex?
If we poke around a little bit, we see that the pagination on top is inside of a row with the class rgPager
. If we scroll down to the bottom, we see it's also inside of .rgPager
.
Now that we've identified them, it's time to remove them. In BeautifulSoup, you can remove elements with decompose()
.
# Select all elements with the class rgPager, then remove them
pagers = doc.select(".rgPager")
for pager in pagers:
pager.decompose()
Feeding the edited HTML to pd.read_html
¶
The unwanted parts of the page have been removed, we can now send the HTML right to pandas. To convert the edited document to HTML we'll use str(doc)
.
Be sure to not use
requests.text
! That's the original, unedited document.
import pandas as pd
tables = pd.read_html(str(doc))
df = tables[0]
df
(Estimated) Report Publish Date(s) | Report(s) | Event Date | Location | Make/Model | RegistrationNumber | NTSB No. | EventSeverity | Type of Air Carrier Operation and Carrier Name (Doing Business As) | |
---|---|---|---|---|---|---|---|---|---|
0 | Foreign | Foreign (HTML) | 12/31/2007 | Guernsey, United Kingdom | Cessna T303 | NaN | CHI08WA075 | Nonfatal | NaN |
1 | Factual 01/25/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/31/2007 | Santa Ana, CA | Piper PA-12 | N2800D | SEA08CA056 | Nonfatal | NaN |
2 | Factual 01/29/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Alexandria, MN | Lerohl RV-8 | N5093F | CHI08CA057 | Nonfatal | NaN |
3 | Factual 09/05/2008Final 06/20/2014 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Paso Robles, CA | Cirrus Design Corp. SR22 | N254SR | LAX08FA043 | Fatal(1) | NaN |
4 | Factual 01/15/2009Final 01/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Cherokee, AL | BELL 206L-3 | N109AE | NYC08FA071 | Fatal(3) | NaN |
5 | Factual 11/11/2008Final 12/03/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crowley, TX | ALTHOUSE RAF 2000 GTX | N136DG | DFW08LA052 | Fatal(2) | NaN |
6 | Factual 01/29/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Gunnison, CO | Piper PA-46-310P | N33MF | DEN08CA045 | Nonfatal | NaN |
7 | Factual 02/14/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Abingdon, IL | Ercoupe (Eng & Research Corp.) 415D | N94131 | CHI08CA056 | Nonfatal | NaN |
8 | Factual 02/15/2008Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crystal Falls, MI | Cessna 172F | N8191U | CHI08CA058 | Nonfatal | NaN |
9 | Factual 12/31/2008Final 07/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Venice, LA | Bell 206L1 | N211EL | DFW08FA053 | Fatal(1) | NaN |
A bonus edit¶
A troublesome part of this is the (Estimated) Report Publish Date(s)
column, which seems all smashed together.
df['(Estimated) Report Publish Date(s)'].loc[1]
'Factual 01/25/2008Final 02/28/2008'
This is happening because of the <br>
elements. You can read more about fixing that on the br element solutions page, but for now we'll just replace each br with a newline.
brs = doc.find_all("br")
for br in brs:
br.replace_with("\n")
Now when we take a look at our results, they're formatted in the way we'd expect!
tables = pd.read_html(str(doc))
df = tables[0]
df
(Estimated) Report Publish Date(s) | Report(s) | Event Date | Location | Make/Model | Registration Number | NTSB No. | Event Severity | Type of Air Carrier Operation and Carrier Name (Doing Business As) | |
---|---|---|---|---|---|---|---|---|---|
0 | Foreign | Foreign (HTML) | 12/31/2007 | Guernsey, United Kingdom | Cessna T303 | NaN | CHI08WA075 | Nonfatal | NaN |
1 | Factual 01/25/2008 Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/31/2007 | Santa Ana, CA | Piper PA-12 | N2800D | SEA08CA056 | Nonfatal | NaN |
2 | Factual 01/29/2008 Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Alexandria, MN | Lerohl RV-8 | N5093F | CHI08CA057 | Nonfatal | NaN |
3 | Factual 09/05/2008 Final 06/20/2014 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Paso Robles, CA | Cirrus Design Corp. SR22 | N254SR | LAX08FA043 | Fatal(1) | NaN |
4 | Factual 01/15/2009 Final 01/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/30/2007 | Cherokee, AL | BELL 206L-3 | N109AE | NYC08FA071 | Fatal(3) | NaN |
5 | Factual 11/11/2008 Final 12/03/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crowley, TX | ALTHOUSE RAF 2000 GTX | N136DG | DFW08LA052 | Fatal(2) | NaN |
6 | Factual 01/29/2008 Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Gunnison, CO | Piper PA-46-310P | N33MF | DEN08CA045 | Nonfatal | NaN |
7 | Factual 02/14/2008 Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Abingdon, IL | Ercoupe (Eng & Research Corp.) 415D | N94131 | CHI08CA056 | Nonfatal | NaN |
8 | Factual 02/15/2008 Final 02/28/2008 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Crystal Falls, MI | Cessna 172F | N8191U | CHI08CA058 | Nonfatal | NaN |
9 | Factual 12/31/2008 Final 07/15/2009 | Final Report PDF | HTML Data Summary (PDF) | 12/29/2007 | Venice, LA | Bell 206L1 | N211EL | DFW08FA053 | Fatal(1) | NaN |
Works great!
Depending on what you're looking for and how you like to work with your data, you might want to replace the
<br>
with a comma, space, or anything else.