Pretending to be a browser with requests and BeautifulSoup¶

Some web servers are insistent that no one scrape them. There are a lot of levels you can use to get around these sorts of restrictions. While using a browser automation tool like Selenium or Playwright is the most powerful, it's also more complicated and slower. If you're just getting started with web scraping or have a lot lot lot of pages to scrape, you might want to start with something simpler first.

Faking your User-Agent¶

The easiest way to pretend to be a browser is to change your User-Agent. A User-Agent is a string that tells the web server what kind of browser you're using.

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac macOS 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
requests.get('https://www.example.com', headers=headers)

Spoofing a real web browser request¶

When you visit a web site, a lot more information beyond just your User-Agent is sent to the web server. Instead of building up the pieces from scratch, you can instead copy a request from your browser and send it to the web server using requests. Sometimes this gives enough information that you can just use requests and BeautifulSoup instead of a web automation tool.

There are a few ways to do this. My favorite is to use the Copy as cURL option from the Chrome developer tools.

Open up the developer tools View > Developer > Developer Tools or ⌘⌥I
Click on the Network tab
Visit the web page you want to scrape
Right click on the request you want to copy – it's probably the one at the top of the list – and select Copy > Copy as cURL (or cURL (bash) if you're on Windows).

It will give you something awful that is not Python code – it's curl code, for the command line – so you'll want to visit https://curlconverter.com/ to convert it to Python code. The result will still be long and awful, but you'll probably be able to understand a little more of it:

import requests

cookies = {
    'nyt-a': '02M776PT3GdGh6vDY0h',
    'purr-cache': '<K0<rC_<G',
    'NYT-T': 'ok',
    'nyt-auth-method': 'username',
    'b2b_cig_opt': '%7B%22isrpUser%22%7D',
    'nyt-gdpr': '0',
    'nyt-geo': 'US',
    'nyt-cmots': '',
    'datadome': '7WcR9I_RUGozflOibxruPWQ5ftYU7YvwVb1oUDgMUc95fL0qNbqHLGbBQAs4zzyVaUzKv22PnEkMMoKZ5pFXlYSzmA-G2xPd6owLFhf34wg',
    'nyt-m': '-46a4-8b1a-9c297914e621&prt=i.0&ft=i.0&fv=i.0&v=i.0&pr=l.4.0.0.0.0&ier=i.0&iru=i.1&t=i.2&imv=i.0&igf=i.0&ira=i.0&igu=i.1&e=i.1669903200&iir=i.0',
    'nyt-b3-traceid': 'dbc374352dfc4bd86067635aa1654',
    'nyt-purr': 'cfhhhhckfh',
    'SIDNY': 'CBMSKQJ-I--8luzWAMgGFADsN',
}

headers = {
    'authority': 'www.nytimes.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'en-US,en;q=0.9,ru;q=0.8',
    'cache-control': 'max-age=0',
    'dnt': '1',
    'if-modified-since': 'Mon, 28 Nov 2022 00:06:21 GMT',
    'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'sec-gpc': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac macOS 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
}

response = requests.get('https://www.nytimes.com/', cookies=cookies, headers=headers)

Cut and paste that into your Python code and you might be able to scrape the page!