Selecting elements on a page
A big part of scraping is figuring out how to pick the pieces on the page you’re interested in.
Grabbing parts of the page
Let’s say we’d like to scrape the New York Times homepage. Maybe we’re interested in all of the headlines.
There are three options for how to grab the headlines.
- By tag name
- By class
- By id
Let’s talk about how (and when) to use BeautifulSoup to select tags.
Selecting by tag names
Basic HTML looks something like this:
<h1>This is a big header</h1>
<p>This is a paragraph</p>
Where each tag describes the content it surrounds. A page can have a million h1
tags (headers) or a million p
tags (paragraphs) or million img
tags (images) or anything else, and they’re very easy to find using BeautifulSoup.
# Find all of the paragraphs
paragraphs = doc.find_all('p')
for paragraph in paragraphs:
print(paragraph.text)
Selecting by class names
Since you might not want every header to look the same, you can also give tags a class. That way we can have food headers be brown, sports headers be blue, and breaking news headers be red and huge and flashing.
<h3 class="food-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="news-header breaking">This is a header</h3>
Classes let web developers hook into the HTML to give specific kinds of elements specific styles. They use CSS - cascading style sheets - to say things like “make things with the class of sports-header be blue.” We aren’t covering CSS right now, but I thought you should know!
We won’t use classes to style things, we’ll use them to grab certain elements on the page. Classes are the most common way of selecting page elements when scraping.
# Find all h3 tags with the sports-header class
sports_headers = doc.find_all('h3', attrs={ 'class': 'sports-header'})
# Find ANY kind of tags with the sports-header class
sports_headers = doc.find_all(class_= 'sports-header')
A big secret with classes is that you separate multiple classes with a space.
<h3 class="news-header breaking">This is a header</h3>
The h3
above has two classes - news-header
and breaking
- and you can find it using either.
Selecting by class names
IDs are similar to classes in that web developers use them to style certain elements on the page. Unlike classes, though, they should be unique on the page.
<div id="sidebar">This is a sidebar</div>
You would only have one id="sidebar"
, while with classes you can have many. As a result, you don’t usually use .find_all
when selecting by ID.
sidebar = doc.find(id='sidebar')
You can use .find_all
, it’s just that you’ll usually just be working with the first element.