A big part of scraping is figuring out how to pick the pieces on the page you’re interested in.

Grabbing parts of the page

Let’s say we’d like to scrape the New York Times homepage. Maybe we’re interested in all of the headlines.

There are three options for how to grab the headlines.

  1. By tag name
  2. By class
  3. By id

Let’s talk about how (and when) to use Selenium to select tags.

Selecting by tag names

Basic HTML looks something like this:

<h1>This is a big header</h1>
<p>This is a paragraph</p>

Where each tag describes the content it surrounds. A page can have a million h1 tags (headers) or a million p tags (paragraphs) or million img tags (images) or anything else, and they’re very easy to find using Selenium.

# Find all of the paragraphs
paragraphs = driver.find_elements_by_tag_name('p')
for paragraph in paragraphs:
  print(paragraph.text)

Selecting by class names

Since you might not want every header to look the same, you can also give tags a class. That way we can have food headers be brown, sports headers be blue, and breaking news headers be red and huge and flashing.

<h3 class="food-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="sports-header">This is a header</h3>
<h3 class="news-header breaking">This is a header</h3>

Classes let web developers hook into the HTML to give specific kinds of elements specific styles. They use CSS - cascading style sheets - to say things like “make things with the class of sports-header be blue.” We aren’t covering CSS right now, but I thought you should know!

We won’t use classes to style things, we’ll use them to grab certain elements on the page. Classes are the most common way of selecting page elements when scraping.

# Find all h3 tags with the sports-header class
sports_headers = driver.find_elements_by_class_name('sports-header')

A big secret with classes is that you separate multiple classes with a space.

<h3 class="news-header breaking">This is a header</h3>

The h3 above has two classes - news-header and breaking - and you can find it using either.

Selecting by IDs

IDs are similar to classes in that web developers use them to style certain elements on the page. Unlike classes, though, they should be unique on the page.

<div id="sidebar">This is a sidebar</div>

You would only have one id="sidebar", while with classes you can have many. As a result, you don’t usually use .find_all when selecting by ID.

sidebar = driver.find_element_by_id('sidebar')

You can use .find_all, it’s just that you’ll usually just be working with the first element.

When things go missing

One of the big issues with Selenium is that if something doesn’t exist on the page, it freaks out and throws an error. For example, if I try to find #dinosaur-park and it isn’t there, I get a NoSuchElementException and my code stops working.

driver.find_element_by_id('dinosaur-park')

…will give me…

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"id","selector":"dinosaur-park"}
  (Session info: chrome=67.0.3396.79)
  (Driver info: chromedriver=2.38.552518 (183d19265345f54ce39cbb94cf81ba5f15905011),platform=Mac OS X 10.12.6 x86_64)

To get around that, we need to tell Python “hey, try to do this, but if it doesn’t work, that’s ok!” We can accomplish this by using try and except

try:
  driver.find_element_by_id('dinosaur-park')
except:
  print("Couldn't find it!")

The part under try is run, and if it throws an error… we just ignore it and skip down to except!

This is very useful for clicking “next” buttons. When you get to the last page, there’s no ‘next’ button, and you get an error.