Fixing br elements when scraping¶

When you're scraping a web page with BeautifulSoup,   elements can elevate a typical scraping session into an irritating one. If we look at our page at https://jsoma.github.io/scraping-examples/br-elements.html, we see we have a nice list of items, spread out across multiple lines.

a list, maybe?

In [74]:

            
                Copied!
                
import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)

element = doc.select_one("#the-important-list")
import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)

element = doc.select_one("#the-important-list")

In [75]:

            
                Copied!
                
element
element

Out[75]:

<p id="the-important-list">1. Lorem<br/>2. Ipsum<br/>3. Dolor<br/>4. Sit<br/>5. Amet<br/></p>

But they aren't a list, even though they're on different lines: it turns out they're all separated by the line break tag  ! This makes them smashed together when we ask for .text.

In [51]:

            
                Copied!
                
lines = element.text
print(lines)
lines = element.text
print(lines)

1. Lorem2. Ipsum3. Dolor4. Sit5. Amet

In [ ]:

            
                Copied!
                
If we want to do some sort of analysis on our
If we want to do some sort of analysis on our

Just ignore the br elements!¶

Based on how your content is formatted and what you're interested in doing, you have a few options. One simple one is to ignore the   tags completely, tell BeautifulSoup to grab text that isn't inside of another element. To do this you use .find_all(string=True).

In [82]:

            
                Copied!
                
# Get the strings directly under #the-important-list
doc.select_one("#the-important-list").find_all(string=True)
# Get the strings directly under #the-important-list
doc.select_one("#the-important-list").find_all(string=True)

Out[82]:

['1. Lorem', '2. Ipsum', '3. Dolor', '4. Sit', '5. Amet']

You can also use .strings to get the same information (although you don't get a list, you'll need to loop with it).

In [81]:

            
                Copied!
                
for string in doc.select_one("#the-important-list").strings:
    print(string)
for string in doc.select_one("#the-important-list").strings:
    print(string)

1. Lorem
2. Ipsum
3. Dolor
4. Sit
5. Amet

Replacing line breaks with newlines¶

The   tag makes a new line in HTML, but not in normal text! The computer-y way to make a new line in a normal text file is \n.

One option to get around the   problem is to replace the   elements with a newline character \n.

In [52]:

            
                Copied!
                
for br in doc.select("br"):
    br.replace_with("\n")
for br in doc.select("br"):
    br.replace_with("\n")

In [53]:

            
                Copied!
                
print(element.text)
print(element.text)

1. Lorem
2. Ipsum
3. Dolor
4. Sit
5. Amet

Perfect!

Depending on what you're looking for and how you like to work with your data, you might want to replace the   with a comma, space, or anything else.

Creating elements for the items around the br¶

Oftentimes   elements are used to separate items on a page which should have been in separate tags. The elements we're working with here should clearly be a list, not a paragraph separated by   tags.

In [58]:

            
                Copied!
                
import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)

element = doc.select_one("#the-important-list")
element
import requests
from bs4 import BeautifulSoup

response = requests.get("http://127.0.0.1:5500/br-elements.html")
doc = BeautifulSoup(response.text)

element = doc.select_one("#the-important-list")
element

Out[58]:

<p id="the-important-list">1. Lorem<br/>2. Ipsum<br/>3. Dolor<br/>4. Sit<br/>5. Amet<br/></p>

So we'll force them to be different tags! Let's find all of the   tags, grab the text element before them with previous_sibling, and wrap them up in div tags with a specific class.

In [59]:

            
                Copied!
                
for br in doc.select("#the-important-list br"):
    text_element = br.previous_sibling
    text_element.wrap(doc.new_tag("div", { 'class': 'list-item'}))
for br in doc.select("#the-important-list br"):
    text_element = br.previous_sibling
    text_element.wrap(doc.new_tag("div", { 'class': 'list-item'}))

Now if we look at #the-important-list, each of our text elements is wrapper in a div.

In [63]:

            
                Copied!
                
doc.select_one("#the-important-list")
doc.select_one("#the-important-list")

Out[63]:

<p id="the-important-list"><div>1. Lorem</div><br/><div>2. Ipsum</div><br/><div>3. Dolor</div><br/><div>4. Sit</div><br/><div>5. Amet</div><br/></p>

This allows us to use our "normal" way of scraping instead of splitting on \n or anything wild like that.

In [65]:

            
                Copied!
                
elements = doc.select("#the-important-list > div")
for element in elements:
    print(element.text)
elements = doc.select("#the-important-list > div")
for element in elements:
    print(element.text)

1. Lorem
2. Ipsum
3. Dolor
4. Sit
5. Amet

In [ ]: