scraping-for-every-dataframe-row

Okay, so, well, maybe I could teach you how .apply works, right? Maybe we could go deep into scraping, go big on columns vs rows in pandas, learn every single thing about how everything works?

Or, we could just figure out how to do it. That’s easier for me to write, so I’m going to be lazy.

Scraping a single page on Yelp

Let’s say you’re scraping a page on Yelp. Our page is going to be this Shake Shake location.

We’ll scrape it just like normal, and make a dictionary of the information on it.

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://www.yelp.com/biz/shake-shack-new-york-54')

store_name = driver.find_element_by_class_name("../biz-page-title").text
full_address = driver.find_element_by_class_name("street-address").text
stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
categories = driver.find_element_by_class_name("category-str-list").text

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store

{'address': '2957 Broadway\nNew York, NY 10025',
 'categories': 'Hot Dogs, Burgers, Ice Cream & Frozen Yogurt',
 'name': 'Shake Shack',
 'stars': '3.5 star rating'}

Scraping many pages on Yelp

But sometimes instead of scraping one page, you need to scrape many pages. In this case, you need two things:

A dataframe, where you’re going to scrape for each row
A function to do the actual scraping

Our dataframe

import pandas as pd

df = pd.read_csv("yelp.csv")
df.head(3)

	name	slug
0	Shake Shack	shake-shack-new-york-54
1	Flat Top	flat-top-new-york
2	Friedman's	friedmans-new-york-62

A function to do the scraping

We’re just going to take our old scraping code and make a few adjustments:

1. Add `def somethingsomething(row)` to turn it into a function

Because it’s a function, we’ll need to indent.

Also, we want to make sure we do NOT have driver = webdriver.Chrome() inside of the function, or else it will make a new Chrome every time we want to visit another page.

2. Use the `row` variable so it isn’t always scraping the same page

Before we always got the same URL from Yelp. We don’t want to do that anymore!

Old code

driver.get("https://www.yelp.com/biz/shake-shack-new-york-54")

Now we have a row variable that is our row of data. If we want to build a URL, we take "https://www.yelp.com/biz/" and add row['slug'] (that’s how Yelp URLs look).

New code

driver.get("https://www.yelp.com/biz/" + row['slug'])

3. Return a `pd.Series` of our data instead of creating a dictionary

Old code

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store

Because it’s a function, we need to return something - and to add columns to our dataframe, it needs to be a pd.Series.

New code

return pd.Series({
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
})

Complete OLD code

driver = webdriver.Chrome()

driver.get("https://www.yelp.com/biz/shake-shack-new-york-54")

store_name = driver.find_element_by_class_name("biz-page-title").text
full_address = driver.find_element_by_class_name("street-address").text
stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
categories = driver.find_element_by_class_name("category-str-list").text

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store

Complete NEW code

def get_yelp_info(row):
    driver.get("https://www.yelp.com/biz/" + row['slug'])

    store_name = driver.find_element_by_class_name("biz-page-title").text
    full_address = driver.find_element_by_class_name("street-address").text
    stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
    categories = driver.find_element_by_class_name("category-str-list").text

    return pd.Series({
        'store_name': store_name,
        'address': full_address,
        'stars': stars,
        'categories': categories
    })

Using our function

Now that we’ve made a function, we need to use it.

Open up a new driver
Use .apply to use the function on each row
Use .join to add the columns to the dataframe

Basically you’ll always cut and paste this code. Be sure to change your variable names.

# Open up a new Chrome
driver = webdriver.Chrome()

# Take every row and send it to get_yelp_info, and combine with old data
new_df = df.apply(get_yelp_info, axis=1).join(df)
new_df.head()

	address	categories	stars	store_name	name	slug	url
0	2957 Broadway\nNew York, NY 10025	Hot Dogs, Burgers, Ice Cream & Frozen Yogurt	3.5 star rating	Shake Shack	Shake Shack	shake-shack-new-york-54	https://www.yelp.com/biz/shake-shack-new-york-54
1	1241 Amsterdam Ave\nNew York, NY 10027	American (New), Cafes, Breakfast & Brunch	4.0 star rating	Flat Top	Flat Top	flat-top-new-york	https://www.yelp.com/biz/flat-top-new-york
2	1187 Amsterdam Ave\nNew York, NY 10027	American (Traditional), Breakfast & Brunch	3.5 star rating	Friedman’s	Friedman's	friedmans-new-york-62	https://www.yelp.com/biz/friedmans-new-york-62
3	2937 Broadway\nNew York, NY 10025	Salad, Vegetarian	3.0 star rating	sweetgreen	sweetgreen	sweetgreen-new-york-6	https://www.yelp.com/biz/sweetgreen-new-york-6
4	2168 Frederick Douglass Blvd\nNew York, NY 10026	Italian, Breakfast & Brunch, Cocktail Bars	4.0 star rating	Lido	Lido	lido-new-york	https://www.yelp.com/biz/lido-new-york

But what about BeautifulSoup?

No problem, you can do the exact same thing. This isn’t about Selenium, it’s about pandas!

import requests
from bs4 import BeautifulSoup

def get_yelp_with_bs(row):
    response = requests.get("https://www.yelp.com/biz/" + row['slug'])
    doc = BeautifulSoup(response.text, 'html.parser')
    
    store_name = doc.find(class_="biz-page-title").text
    full_address = doc.find(class_="street-address").text
    stars = doc.find(class_="i-stars")['title']
    categories = doc.find(class_="category-str-list").text

    return pd.Series({
        'store_name': store_name,
        'address': full_address,
        'stars': stars,
        'categories': categories
    })

bs_df = df.apply(get_yelp_with_bs, axis=1).join(df)
bs_df.head()

	address	categories	stars	store_name	name	slug	url
0	\n\n 2957 BroadwayNew York, NY 10025\n ...	\nHot Dogs,\n Burgers,\n ...	3.5 star rating	\n Shake Shack\n	Shake Shack	shake-shack-new-york-54	https://www.yelp.com/biz/shake-shack-new-york-54
1	\n\n 1241 Amsterdam AveNew York, NY 100...	\nAmerican (New),\n Cafes,\...	4.0 star rating	\n Flat Top\n	Flat Top	flat-top-new-york	https://www.yelp.com/biz/flat-top-new-york
2	\n\n 1187 Amsterdam AveNew York, NY 100...	\nAmerican (Traditional),\n ...	3.5 star rating	\n Friedman’s\n	Friedman's	friedmans-new-york-62	https://www.yelp.com/biz/friedmans-new-york-62
3	\n\n 2937 BroadwayNew York, NY 10025\n ...	\nSalad,\n Vegetarian\n	3.0 star rating	\n sweetgreen\n	sweetgreen	sweetgreen-new-york-6	https://www.yelp.com/biz/sweetgreen-new-york-6
4	\n\n 2168 Frederick Douglass BlvdNew Yo...	\nItalian,\n Breakfast & Br...	4.0 star rating	\n Lido\n	Lido	lido-new-york	https://www.yelp.com/biz/lido-new-york

BONUS: How fast is Selenium vs. BeautifulSoup?

We can use the magic %%time to time them. IT’S A RACE!

First up, Selenium:

%%time
new_df = df.apply(get_yelp_info, axis=1).join(df)
new_df.head()

CPU times: user 89.9 ms, sys: 7.38 ms, total: 97.2 ms
Wall time: 53.9 s

Now let’s try BeautifulSoup and requests:

%%time
bs_df = df.apply(get_yelp_with_bs, axis=1).join(df)
bs_df.head(2)

CPU times: user 6.46 s, sys: 85.3 ms, total: 6.54 s
Wall time: 41.2 s

Not that much different in this case!