Okay, so, well, maybe I could teach you how .apply
works, right? Maybe we
could go deep into scraping, go big on columns vs rows in pandas, learn every
single thing about how everything works?
Or, we could just figure out how to do it. That’s easier for me to write, so I’m going to be lazy.
Scraping a single page on Yelp
Let’s say you’re scraping a page on Yelp. Our page is going to be this Shake Shake location.
We’ll scrape it just like normal, and make a dictionary of the information on it.
{'address': '2957 Broadway\nNew York, NY 10025',
'categories': 'Hot Dogs, Burgers, Ice Cream & Frozen Yogurt',
'name': 'Shake Shack',
'stars': '3.5 star rating'}
Scraping many pages on Yelp
But sometimes instead of scraping one page, you need to scrape many pages. In this case, you need two things:
- A dataframe, where you’re going to scrape for each row
- A function to do the actual scraping
Our dataframe
name | slug | |
---|---|---|
0 | Shake Shack | shake-shack-new-york-54 |
1 | Flat Top | flat-top-new-york |
2 | Friedman's | friedmans-new-york-62 |
A function to do the scraping
We’re just going to take our old scraping code and make a few adjustments:
1. Add def somethingsomething(row)
to turn it into a function
Because it’s a function, we’ll need to indent.
Also, we want to make sure we do NOT have driver = webdriver.Chrome()
inside
of the function, or else it will make a new Chrome every time we want to visit
another page.
2. Use the row
variable so it isn’t always scraping the same page
Before we always got the same URL from Yelp. We don’t want to do that anymore!
Old code
driver.get("https://www.yelp.com/biz/shake-shack-new-york-54")
Now we have a row
variable that is our row of data. If we want to build a URL,
we take "https://www.yelp.com/biz/"
and add row['slug']
(that’s how Yelp
URLs look).
New code
driver.get("https://www.yelp.com/biz/" + row['slug'])
3. Return a pd.Series
of our data instead of creating a dictionary
Old code
store = {
'name': store_name,
'address': full_address,
'stars': stars,
'categories': categories
}
store
Because it’s a function, we need to return something - and to add columns to
our dataframe, it needs to be a pd.Series
.
New code
return pd.Series({
'name': store_name,
'address': full_address,
'stars': stars,
'categories': categories
})
Complete OLD code
Complete NEW code
Using our function
Now that we’ve made a function, we need to use it.
- Open up a new
driver
- Use
.apply
to use the function on each row - Use
.join
to add the columns to the dataframe
Basically you’ll always cut and paste this code. Be sure to change your variable names.
address | categories | stars | store_name | name | slug | url | |
---|---|---|---|---|---|---|---|
0 | 2957 Broadway\nNew York, NY 10025 | Hot Dogs, Burgers, Ice Cream & Frozen Yogurt | 3.5 star rating | Shake Shack | Shake Shack | shake-shack-new-york-54 | https://www.yelp.com/biz/shake-shack-new-york-54 |
1 | 1241 Amsterdam Ave\nNew York, NY 10027 | American (New), Cafes, Breakfast & Brunch | 4.0 star rating | Flat Top | Flat Top | flat-top-new-york | https://www.yelp.com/biz/flat-top-new-york |
2 | 1187 Amsterdam Ave\nNew York, NY 10027 | American (Traditional), Breakfast & Brunch | 3.5 star rating | Friedman’s | Friedman's | friedmans-new-york-62 | https://www.yelp.com/biz/friedmans-new-york-62 |
3 | 2937 Broadway\nNew York, NY 10025 | Salad, Vegetarian | 3.0 star rating | sweetgreen | sweetgreen | sweetgreen-new-york-6 | https://www.yelp.com/biz/sweetgreen-new-york-6 |
4 | 2168 Frederick Douglass Blvd\nNew York, NY 10026 | Italian, Breakfast & Brunch, Cocktail Bars | 4.0 star rating | Lido | Lido | lido-new-york | https://www.yelp.com/biz/lido-new-york |
But what about BeautifulSoup?
No problem, you can do the exact same thing. This isn’t about Selenium, it’s about pandas!
address | categories | stars | store_name | name | slug | url | |
---|---|---|---|---|---|---|---|
0 | \n\n 2957 BroadwayNew York, NY 10025\n ... | \nHot Dogs,\n Burgers,\n ... | 3.5 star rating | \n Shake Shack\n | Shake Shack | shake-shack-new-york-54 | https://www.yelp.com/biz/shake-shack-new-york-54 |
1 | \n\n 1241 Amsterdam AveNew York, NY 100... | \nAmerican (New),\n Cafes,\... | 4.0 star rating | \n Flat Top\n | Flat Top | flat-top-new-york | https://www.yelp.com/biz/flat-top-new-york |
2 | \n\n 1187 Amsterdam AveNew York, NY 100... | \nAmerican (Traditional),\n ... | 3.5 star rating | \n Friedman’s\n | Friedman's | friedmans-new-york-62 | https://www.yelp.com/biz/friedmans-new-york-62 |
3 | \n\n 2937 BroadwayNew York, NY 10025\n ... | \nSalad,\n Vegetarian\n | 3.0 star rating | \n sweetgreen\n | sweetgreen | sweetgreen-new-york-6 | https://www.yelp.com/biz/sweetgreen-new-york-6 |
4 | \n\n 2168 Frederick Douglass BlvdNew Yo... | \nItalian,\n Breakfast & Br... | 4.0 star rating | \n Lido\n | Lido | lido-new-york | https://www.yelp.com/biz/lido-new-york |
BONUS: How fast is Selenium vs. BeautifulSoup?
We can use the magic %%time
to time them. IT’S A RACE!
First up, Selenium:
CPU times: user 89.9 ms, sys: 7.38 ms, total: 97.2 ms
Wall time: 53.9 s
Now let’s try BeautifulSoup and requests:
CPU times: user 6.46 s, sys: 85.3 ms, total: 6.54 s
Wall time: 41.2 s
Not that much different in this case!