Scraping supplement

You should work through a notebook as you read this page. You can find the notebook here. We’re using Selenium for the code, but the same concepts can also apply to BeautifulSoup.

Normally when you’re scraping stuff from the Internet, you use classes to pick the parts you want.

This section is Part 1 in the notebook.

For example, let’s say we have a simple page about a book.

<h1 class="title">How to Scrape Things</h1>
<h3 class="subhead">Some Supplemental Materials</h3>
<p class="byline">By Jonathan Soma</p>

If we wanted to select the title, we’d just ask for it.

driver.find_element_by_class_name('title')

Problem solved! But we don’t want the tag, we want the text. So let’s add a .text onto the end of it and print it.

print(driver.find_element_by_class_name('title').text)

If we want the rest of it - the subhead and the byline - we just use the same classes that they have in the HTML.

print(driver.find_element_by_class_name('title').text)
print(driver.find_element_by_class_name('subhead').text)
print(driver.find_element_by_class_name('byline').text)

Downgrading to tag names

This section is Part 2 in the notebook.

Sometimes people don’t use classes, but it isn’t a big deal. For the previous example, maybe the page looks like this instead:

<h1>How to Scrape Things</h1>
<h3>Some Supplemental Materials</h3>
<p>By Jonathan Soma</p>

Luckily we can also select items on the page by their tag name, like h1 or h3 or p. No classes? No problem!

print(driver.find_element_by_tag_name('h1').text)
print(driver.find_element_by_tag_name('h3').text)
print(driver.find_element_by_tag_name('p').text)

Multiples of the same tag

This section is Part 3 in the notebook.

It starts to get more complicated when we have multiples of the same tag on the same page. Before it was easy - h1 was the title, h3 was the subhead, and p was the byline.

But what if they all use the same tag?

<p>How to Scrape Things</p>
<p>Some Supplemental Materials</p>
<p>By Jonathan Soma</p>

This is a little more complicated, but think about how you would explain it to me with words - “the title is the first paragraph. The subhead is the second paragraph. The byline is the third (or last) paragraph.” We can do the same thing with Python.

# Find all of the paragraphs on the page
paragraphs = driver.find_elements_by_tag_name('p')
# Print out the first one, the second one, the third one
print("The title is", paragraphs[0].text)
print("Subhead is", paragraphs[1].text)
print("Byline is", paragraphs[2].text)

We don’t usually have a lot of paragraphs in a row, though, usually it’s table cells, the td tag. For example, a row in a table looks like this.

Moving toward tables

This section is Part 4 in the notebook. I had to cheat a little and add a <table> in the sample page, otherwise Chrome would ignore the row and cells.

<tr>
  <td>How to Scrape Things</td>
  <td>Some Supplemental Materials</td>
  <td>By Jonathan Soma</td>
</tr>

You would say the same thing as before - “the title is the first table cell. The subhead is the second table cell. The byline is the third (or last) table cell.” The code is the same, too!

# Find all of the tds on the page
cells = driver.find_elements_by_tag_name('td')
# Print out the first one, the second one, the third one
print("The title is", cells[0].text)
print("Subhead is", cells[1].text)
print("Byline is", cells[2].text)

Storing in a dictionary

This section is Part 5 in the notebook.

Printing out data isn’t very useful, usually we’re more interested in saving it. We do that with a pandas dataframe.

When we make our own pandas dataframe, we build it from a list of dictionaries. But to make a list of dictionaries, we first need a single dictionary, right? Let’s use the same HTML…

<tr>
  <td>How to Scrape Things</td>
  <td>Some Supplemental Materials</td>
  <td>By Jonathan Soma</td>
</tr>

…and change our Python to create a dictionary instead of printing it out…

# Find all of the tds on the page
cells = driver.find_elements_by_tag_name('td')

# Start with an empty dictionary
book = {}

# Add the keys one by one
book['title'] = cells[0].text
book['subhead'] = cells[1].text
book['byline'] = cells[2].text

# Print it out
print("Book looks like", book)

Working with tables

This section is Part 6 in the notebook.

When we have a table in HTML, usually we want to go through each of the rows.

<tr>
  Row of stuff
</tr>
<tr>
  Row of stuff
</tr>
<tr>
  Row of stuff
</tr>

rows = driver.find_elements_by_tag_name('tr')

for row in rows:
  print("Our row looks like", row.text)

But each one of those rows always has stuff inside, right? It looks more like this:

<tr>
  <td>How to Scrape Things</td>
  <td>Some Supplemental Materials</td>
  <td>By Jonathan Soma</td>
</tr>
<tr>
  <td>How to Scrape Many Things</td>
  <td>But, Is It Even Possible?</td>
  <td>By Sonathan Joma</td>
</tr>
<tr>
  <td>The End of Scraping</td>
  <td>Let's All Use CSV Files</td>
  <td>By Amos Nathanos</td>
</tr>

We want to separate the titles and the subheads and the bylines instead of just using row.text. First, let’s talk about the wrong way to do this.

# Find all of the tds on the page
cells = driver.find_elements_by_tag_name('td')
# Print out the first one, the second one, the third one
print("The title is", cells[0].text)
print("Subhead is", cells[1].text)
print("Byline is", cells[2].text)

This is going to get all of the td elements on the page. You won’t know if they’re in the first row, or the second row, or the third row.

What we need to do instead is loop through each row (like we did before), and then ask for the td elements inside of that row.

rows = driver.find_elements_by_tag_name('tr')

for row in rows:
  # Find all of the tds inside of THAT ONE ROW
  cells = row.find_elements_by_tag_name('td')
  # Print out the first one, the second one, the third one
  print("The title is", cells[0].text)
  print("Subhead is", cells[1].text)
  print("Byline is", cells[2].text)

Before we were asking for the td elements on the whole page, now we are asking for the td elements inside of each row. That’s why we changed from driver.find_elements to row.find_elements.

When we loop through the rows and use row.find_elements instead of driver.find_elements, Selenium only sees what is inside of that row, so it looks like this:

<td>The End of Scraping</td>
<td>Let's All Use CSV Files</td>
<td>By Amos Nathanos</td>

Just like the simple scraping from before!

driver.find_elements looks on the entire page, while something.find_elements looks inside of that element

Changing it to a real table

This section is Part 7 in the notebook.

What happens when this turns into a real table, with a <table> tag surrounding it?

<table id="booklist">
  <tr>
    <td>How to Scrape Things</td>
    <td>Some Supplemental Materials</td>
    <td>By Jonathan Soma</td>
  </tr>
  <tr>
    <td>How to Scrape Many Things</td>
    <td>But, Is It Even Possible?</td>
    <td>By Sonathan Joma</td>
  </tr>
  <tr>
    <td>The End of Scraping</td>
    <td>Let's All Use CSV Files</td>
    <td>By Amos Nathanos</td>
  </tr>
</table>

Well, exactly nothing. You still want to loop through each row, and then you still want to get the cells inside. Exact same code as before!

If we wanted to get a little crazier, though, or if there were multiple tables on the page, we could grab the table by its id, then pick out only the tr elements inside of it.

table = driver.find_element_by_id('booklist')
rows = table.find_elements_by_tag_name('tr')

for row in rows:
  # Find all of the tds inside of THAT ONE ROW
  cells = row.find_elements_by_tag_name('td')
  # Print out the first one, the second one, the third one
  print("../The title is", cells[0].text)
  print("Subhead is", cells[1].text)
  print("Byline is", cells[2].text)

You don’t have to do that in this situation, though, since it’s the only table on the page.

Changing it to a dictionary

This section is Part 8 in the notebook.

Note: We selected the table first last time, but it’s unnecessary. We’re going back to the original version because it’s simpler.

Now instead of printing, we’ll save it to a dictionary. It’s just a small change, same as we did before.

rows = driver.find_elements_by_tag_name('tr')

for row in rows:
  cells = row.find_elements_by_tag_name('td')

  book = {}
  book['title'] = cells[0].text
  book['subhead'] = cells[1].text
  book['byline'] = cells[2].text
  print("Our book looks like", book)

We build a new book each time we go through the loop, but what happens when we’re done? We don’t use it for anything! We throw it away! So sad, so sad.

Let’s change our code so that every time we make a new book, we save it into a list.

rows = driver.find_elements_by_tag_name('tr')

books = []
for row in rows:
  cells = row.find_elements_by_tag_name('td')

  book = {}
  book['title'] = cells[0].text
  book['subhead'] = cells[1].text
  book['byline'] = cells[2].text
  print("Our book looks like")
  books.append(book)

print("Done! Our list looks like", books)

Creating a dataframe and saving

This section is Part 9 in the notebook.

Now that we have a list of dictionaries, it’s easy to move it to a dataframe.

df = pd.DataFrame(books)
df.head()

Pandas takes the keys and turns them into columns, and each element in our list becomes a row in our dataframe. From there, it’s a quick jump to saving it as a csv.

# Use index=False to prevent the 'extra' number column
df.to_csv("output.csv", index=False)

And then we’re all set!