Okay, so, well, maybe I could teach you how .apply works, right? Maybe we could go deep into scraping, go big on columns vs rows in pandas, learn every single thing about how everything works?

Or, we could just figure out how to do it. That’s easier for me to write, so I’m going to be lazy.

Scraping a single page on Yelp

Let’s say you’re scraping a page on Yelp. Our page is going to be this Shake Shake location.

We’ll scrape it just like normal, and make a dictionary of the information on it.

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://www.yelp.com/biz/shake-shack-new-york-54')

store_name = driver.find_element_by_class_name("../biz-page-title").text
full_address = driver.find_element_by_class_name("street-address").text
stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
categories = driver.find_element_by_class_name("category-str-list").text

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store
{'address': '2957 Broadway\nNew York, NY 10025',
 'categories': 'Hot Dogs, Burgers, Ice Cream & Frozen Yogurt',
 'name': 'Shake Shack',
 'stars': '3.5 star rating'}

Scraping many pages on Yelp

But sometimes instead of scraping one page, you need to scrape many pages. In this case, you need two things:

  1. A dataframe, where you’re going to scrape for each row
  2. A function to do the actual scraping

Our dataframe

import pandas as pd

df = pd.read_csv("yelp.csv")
df.head(3)
name slug
0 Shake Shack shake-shack-new-york-54
1 Flat Top flat-top-new-york
2 Friedman's friedmans-new-york-62

A function to do the scraping

We’re just going to take our old scraping code and make a few adjustments:

1. Add def somethingsomething(row) to turn it into a function

Because it’s a function, we’ll need to indent.

Also, we want to make sure we do NOT have driver = webdriver.Chrome() inside of the function, or else it will make a new Chrome every time we want to visit another page.

2. Use the row variable so it isn’t always scraping the same page

Before we always got the same URL from Yelp. We don’t want to do that anymore!

Old code

driver.get("https://www.yelp.com/biz/shake-shack-new-york-54")

Now we have a row variable that is our row of data. If we want to build a URL, we take "https://www.yelp.com/biz/" and add row['slug'] (that’s how Yelp URLs look).

New code

driver.get("https://www.yelp.com/biz/" + row['slug'])

3. Return a pd.Series of our data instead of creating a dictionary

Old code

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store

Because it’s a function, we need to return something - and to add columns to our dataframe, it needs to be a pd.Series.

New code

return pd.Series({
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
})

Complete OLD code

driver = webdriver.Chrome()

driver.get("https://www.yelp.com/biz/shake-shack-new-york-54")

store_name = driver.find_element_by_class_name("biz-page-title").text
full_address = driver.find_element_by_class_name("street-address").text
stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
categories = driver.find_element_by_class_name("category-str-list").text

store = {
    'name': store_name,
    'address': full_address,
    'stars': stars,
    'categories': categories
}

store

Complete NEW code

def get_yelp_info(row):
    driver.get("https://www.yelp.com/biz/" + row['slug'])

    store_name = driver.find_element_by_class_name("biz-page-title").text
    full_address = driver.find_element_by_class_name("street-address").text
    stars = driver.find_element_by_class_name("i-stars").get_attribute('title')
    categories = driver.find_element_by_class_name("category-str-list").text

    return pd.Series({
        'store_name': store_name,
        'address': full_address,
        'stars': stars,
        'categories': categories
    })

Using our function

Now that we’ve made a function, we need to use it.

  1. Open up a new driver
  2. Use .apply to use the function on each row
  3. Use .join to add the columns to the dataframe

Basically you’ll always cut and paste this code. Be sure to change your variable names.

# Open up a new Chrome
driver = webdriver.Chrome()

# Take every row and send it to get_yelp_info, and combine with old data
new_df = df.apply(get_yelp_info, axis=1).join(df)
new_df.head()
address categories stars store_name name slug url
0 2957 Broadway\nNew York, NY 10025 Hot Dogs, Burgers, Ice Cream & Frozen Yogurt 3.5 star rating Shake Shack Shake Shack shake-shack-new-york-54 https://www.yelp.com/biz/shake-shack-new-york-54
1 1241 Amsterdam Ave\nNew York, NY 10027 American (New), Cafes, Breakfast & Brunch 4.0 star rating Flat Top Flat Top flat-top-new-york https://www.yelp.com/biz/flat-top-new-york
2 1187 Amsterdam Ave\nNew York, NY 10027 American (Traditional), Breakfast & Brunch 3.5 star rating Friedman’s Friedman's friedmans-new-york-62 https://www.yelp.com/biz/friedmans-new-york-62
3 2937 Broadway\nNew York, NY 10025 Salad, Vegetarian 3.0 star rating sweetgreen sweetgreen sweetgreen-new-york-6 https://www.yelp.com/biz/sweetgreen-new-york-6
4 2168 Frederick Douglass Blvd\nNew York, NY 10026 Italian, Breakfast & Brunch, Cocktail Bars 4.0 star rating Lido Lido lido-new-york https://www.yelp.com/biz/lido-new-york

But what about BeautifulSoup?

No problem, you can do the exact same thing. This isn’t about Selenium, it’s about pandas!

import requests
from bs4 import BeautifulSoup
def get_yelp_with_bs(row):
    response = requests.get("https://www.yelp.com/biz/" + row['slug'])
    doc = BeautifulSoup(response.text, 'html.parser')
    
    store_name = doc.find(class_="biz-page-title").text
    full_address = doc.find(class_="street-address").text
    stars = doc.find(class_="i-stars")['title']
    categories = doc.find(class_="category-str-list").text

    return pd.Series({
        'store_name': store_name,
        'address': full_address,
        'stars': stars,
        'categories': categories
    })
bs_df = df.apply(get_yelp_with_bs, axis=1).join(df)
bs_df.head()
address categories stars store_name name slug url
0 \n\n 2957 BroadwayNew York, NY 10025\n ... \nHot Dogs,\n Burgers,\n ... 3.5 star rating \n Shake Shack\n Shake Shack shake-shack-new-york-54 https://www.yelp.com/biz/shake-shack-new-york-54
1 \n\n 1241 Amsterdam AveNew York, NY 100... \nAmerican (New),\n Cafes,\... 4.0 star rating \n Flat Top\n Flat Top flat-top-new-york https://www.yelp.com/biz/flat-top-new-york
2 \n\n 1187 Amsterdam AveNew York, NY 100... \nAmerican (Traditional),\n ... 3.5 star rating \n Friedman’s\n Friedman's friedmans-new-york-62 https://www.yelp.com/biz/friedmans-new-york-62
3 \n\n 2937 BroadwayNew York, NY 10025\n ... \nSalad,\n Vegetarian\n 3.0 star rating \n sweetgreen\n sweetgreen sweetgreen-new-york-6 https://www.yelp.com/biz/sweetgreen-new-york-6
4 \n\n 2168 Frederick Douglass BlvdNew Yo... \nItalian,\n Breakfast & Br... 4.0 star rating \n Lido\n Lido lido-new-york https://www.yelp.com/biz/lido-new-york

BONUS: How fast is Selenium vs. BeautifulSoup?

We can use the magic %%time to time them. IT’S A RACE!

First up, Selenium:

%%time
new_df = df.apply(get_yelp_info, axis=1).join(df)
new_df.head()
CPU times: user 89.9 ms, sys: 7.38 ms, total: 97.2 ms
Wall time: 53.9 s

Now let’s try BeautifulSoup and requests:

%%time
bs_df = df.apply(get_yelp_with_bs, axis=1).join(df)
bs_df.head(2)
CPU times: user 6.46 s, sys: 85.3 ms, total: 6.54 s
Wall time: 41.2 s

Not that much different in this case!