Fixing br elements when scraping¶
When you're scraping a web page with BeautifulSoup, <br>
elements can elevate a typical scraping session into an irritating one. If we look at our page at https://jsoma.github.io/scraping-examples/br-elements.html, we see we have a nice list of items, spread out across multiple lines.
import requests
from bs4 import BeautifulSoup
response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)
element = doc.select_one("#the-important-list")
element
<p id="the-important-list">1. Lorem<br/>2. Ipsum<br/>3. Dolor<br/>4. Sit<br/>5. Amet<br/></p>
But they aren't a list, even though they're on different lines: it turns out they're all separated by the line break tag <br>
! This makes them smashed together when we ask for .text
.
lines = element.text
print(lines)
1. Lorem2. Ipsum3. Dolor4. Sit5. Amet
If we want to do some sort of analysis on our
Just ignore the br elements!¶
Based on how your content is formatted and what you're interested in doing, you have a few options. One simple one is to ignore the <br>
tags completely, tell BeautifulSoup to grab text that isn't inside of another element. To do this you use .find_all(string=True)
.
# Get the strings directly under #the-important-list
doc.select_one("#the-important-list").find_all(string=True)
['1. Lorem', '2. Ipsum', '3. Dolor', '4. Sit', '5. Amet']
You can also use .strings
to get the same information (although you don't get a list, you'll need to loop with it).
for string in doc.select_one("#the-important-list").strings:
print(string)
1. Lorem 2. Ipsum 3. Dolor 4. Sit 5. Amet
Replacing line breaks with newlines¶
The <br>
tag makes a new line in HTML, but not in normal text! The computer-y way to make a new line in a normal text file is \n
.
One option to get around the <br>
problem is to replace the <br>
elements with a newline character \n
.
for br in doc.select("br"):
br.replace_with("\n")
print(element.text)
1. Lorem 2. Ipsum 3. Dolor 4. Sit 5. Amet
Perfect!
Depending on what you're looking for and how you like to work with your data, you might want to replace the
<br>
with a comma, space, or anything else.
Creating elements for the items around the br¶
Oftentimes <br>
elements are used to separate items on a page which should have been in separate tags. The elements we're working with here should clearly be a list, not a paragraph separated by <br>
tags.
import requests
from bs4 import BeautifulSoup
response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)
element = doc.select_one("#the-important-list")
element
<p id="the-important-list">1. Lorem<br/>2. Ipsum<br/>3. Dolor<br/>4. Sit<br/>5. Amet<br/></p>
So we'll force them to be different tags! Let's find all of the <br>
tags, grab the text element before them with previous_sibling
, and wrap them up in div
tags with a specific class.
for br in doc.select("#the-important-list br"):
text_element = br.previous_sibling
text_element.wrap(doc.new_tag("div", { 'class': 'list-item'}))
Now if we look at #the-important-list
, each of our text elements is wrapper in a div.
doc.select_one("#the-important-list")
<p id="the-important-list"><div>1. Lorem</div><br/><div>2. Ipsum</div><br/><div>3. Dolor</div><br/><div>4. Sit</div><br/><div>5. Amet</div><br/></p>
This allows us to use our "normal" way of scraping instead of splitting on \n
or anything wild like that.
elements = doc.select("#the-important-list > div")
for element in elements:
print(element.text)
1. Lorem 2. Ipsum 3. Dolor 4. Sit 5. Amet