Selecting elements on a page using Selenium
A big part of scraping is figuring out how to pick the pieces on the page you’re interested in.
Grabbing parts of the page
Let’s say we’d like to scrape the New York Times homepage. Maybe we’re interested in all of the headlines.
There are three options for how to grab the headlines.
- By tag name
- By class
- By id
Let’s talk about how (and when) to use Selenium to select tags.
Selecting by tag names
Basic HTML looks something like this:
<h1>This is a big header</h1>
<p>This is a paragraph</p>
Where each tag describes the content it surrounds. A page can have a million h1
tags (headers) or a million p
tags (paragraphs) or million img
tags (images) or anything else, and they’re very easy to find using Selenium.
# Find all of the paragraphs
paragraphs = driver.find_elements_by_tag_name('p')
for paragraph in paragraphs:
print(paragraph.text)
Selecting by class names
Since you might not want every header to look the same, you can also give tags a class. That way we can have food headers be brown, sports headers be blue, and breaking news headers be red and huge and flashing.
<h3 class="food-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="news-header breaking">This is a header</h3>
Classes let web developers hook into the HTML to give specific kinds of elements specific styles. They use CSS - cascading style sheets - to say things like “make things with the class of sports-header be blue.” We aren’t covering CSS right now, but I thought you should know!
We won’t use classes to style things, we’ll use them to grab certain elements on the page. Classes are the most common way of selecting page elements when scraping.
# Find all h3 tags with the sports-header class
sports_headers = driver.find_elements_by_class_name('sports-header')
A big secret with classes is that you separate multiple classes with a space.
<h3 class="news-header breaking">This is a header</h3>
The h3
above has two classes - news-header
and breaking
- and you can find it using either.
Selecting by IDs
IDs are similar to classes in that web developers use them to style certain elements on the page. Unlike classes, though, they should be unique on the page.
<div id="sidebar">This is a sidebar</div>
You would only have one id="sidebar"
, while with classes you can have many. As a result, you don’t usually use .find_all
when selecting by ID.
sidebar = driver.find_element_by_id('sidebar')
You can use .find_all
, it’s just that you’ll usually just be working with the first element.
When things go missing
One of the big issues with Selenium is that if something doesn’t exist on the page, it freaks out and throws an error. For example, if I try to find #dinosaur-park
and it isn’t there, I get a NoSuchElementException
and my code stops working.
driver.find_element_by_id('dinosaur-park')
…will give me…
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"id","selector":"dinosaur-park"}
(Session info: chrome=67.0.3396.79)
(Driver info: chromedriver=2.38.552518 (183d19265345f54ce39cbb94cf81ba5f15905011),platform=Mac OS X 10.12.6 x86_64)
To get around that, we need to tell Python “hey, try to do this, but if it doesn’t work, that’s ok!” We can accomplish this by using try
and except
try:
driver.find_element_by_id('dinosaur-park')
except:
print("Couldn't find it!")
The part under try
is run, and if it throws an error… we just ignore it and skip down to except
!
This is very useful for clicking “next” buttons. When you get to the last page, there’s no ‘next’ button, and you get an error.