Pagination when scraping¶
Sometimes you need to scrape multiple pages of a website. For example, if you want to scrape all the products from an e-commerce website, you might need to click the "next" button until you hit the last page of results.
There are a few different patterns to follow when scraping multiple pages.
A specific number of pages¶
A pretty common pattern is to use a for
loop to iterate over a range of numbers.
On a simple site¶
For example, if you want to scrape the first 10 pages of a website, you might write:
rows = []
for page_num in range(1, 11):
print("Scraping page", page_num)
url = f"http://example.com/page/{page_num}"
html = requests.get(url).content
# Scrape scrape scrape
# Scrape scrape scrape
# Always adding to the 'rows' variable
df = pd.DataFrame(rows)
df.head()
This uses the page_num
variable to plug a page number into the URL. If you use range
for this, remember that it doesn't include the final number! range(1, 11)
will give you the numbers 1 through 10.
On an interactive site¶
If you're on an interactive site, you might need to click the next button a specific number of times.
# Visit the web page
driver.get("http://example.com")
rows = []
for page_num in range(10):
print("Scraping page", page_num)
# Scrape scrape scrape
# Scrape scrape scrape
# Always adding to the 'rows' variable
driver.find_element(By.CSS_SELECTOR, "a.next").click()
# Build your dataframe
df = pd.DataFrame(rows)
df.head()
For this one you don't actually use the page_num
variable for anything other than printing it out. range(10)
counts from 0 to 9, so you'll click the next button 10 times.
I find
range(10)
easier to think about as doing ten things than usingrange(1, 11)
like we did in the previous example.
Click "next" until it breaks¶
I personally like this pattern the best!
In this case, we tell the browser to keep clicking the "next" button after we're done scraping each page. At some point the button disappears and it tries to throw an error, but instead when that happens the code just automatically exits the loop without causing any trouble.
# Visit the web page
driver.get("http://example.com")
rows = []
# This loop will run FOREVER (or until there's an error)
while True:
# Scrape scrape scrape
# Scrape scrape scrape
# Always adding to the 'rows' variable
try:
# Try to click the next button
driver.find_element(By.CSS_SELECTOR, "a.next").click()
except:
# Exit the loop if the button isn't there
break
# Build your dataframe
df = pd.DataFrame(rows)
df.head()