Advanced Scraping - Form Submission

import requests
from bs4 import BeautifulSoup

In the beginning there were GET forms

When you’re searching for water at Walmart, the URL looks like this:

https://www.walmart.com/search/?query=water&cat_id=0

It’s easy to scrape! If you wanted to search for guns instead, you’d just change water to guns in the URL and off you go. This nice way of living is parameters in the query string.

# Get the page
response = requests.get("https://www.walmart.com/search/?query=water&cat_id=0")
doc = BeautifulSoup(response.text, 'html.parser')

# Grab all of the titles
title_tags = doc.find_all(class_='prod-ProductTitle')

# Let's print the first 5
for title in title_tags[:5]:
    print(title.text.strip())

Sam's Choice Purified Drinking Water, 10 fl oz, 12 pack
Nestlé Pure Life Purified Water 12 x 16.9fl oz (202.8fl oz)
Fiji Natural Artesian Water, 6pk
Gerber Pure Purified Water, 1.0 GAL
ArrowHead 100% Mountain Spring Water 6 x 23.7 fl oz (142.2 fl oz)

But it isn’t always like that.

But then: POST Forms

But for most forms, though, it isn’t that easy. You type in your info, you click “Search”, and there’s nothing in the URL. For example, try searching at California’s Engineer License Database.

The URL you end up at is something like http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery, which doesn’t mean anything. No parameters in that query string!

If you search through the browser you see a lot of table rows, but if you try it in Python it doesn’t give you anything.

# Get the page
response = requests.get("http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery")
doc = BeautifulSoup(response.text, 'html.parser')

# Grab all of the rows
row_tags = doc.find_all('tr')

# Let's print the first 5
for row in row_tags[:5]:
    print(row.text.strip())

Nothing at all! What did it give us? Let’s look at response.text.

response.text

'<HTML>\n<HEAD>\n<TITLE>License Holders : </TITLE>\n</HEAD>\n<BODY bgcolor="#ffffff">\n<H1>License Holders : </H1>\n<P>\n<B><font color="ff4040" size=+2><I>Error!</I></font><br></B>\n<B>The following unhandled error has occurred in the routine WLLQRYNA$LCEV2.ActionQuery:</B>\n<P>\nORA-01403: no data found\n<P>\n<B>Please contact your support representative.</B>\n<P>\n</BODY>\n</HTML>\n'

If you read closely, that’s an error. It’s because we didn’t send it any search data.

Looking at response.text is THE BEST WAY to find out whether your search worked. You can ctrl+f or just visually search for words you know should be on the page.

Finding our form data

When we clicked “Search,” it also sent the server a bunch of data - all of the options we typed in, or the dropdowns we selected. Here are the steps to find out what data needs to be sent along with your request.

We’re going to use Chrome’s Network tools to analyze all of the requests our browser sends to the server, then imitate them in Python.

Open up Developer Tools in Chrome by selecting View > Developer > Developer Tools.
Select the Network Tab
Visit the page you’re going to do your search from
Click the Clear button up top - 🚫 - then submit your form
The Network tab will fill with activity!
Find the thing in the Network tab that looks like the same name as your webpage. Click it.
On the right-hand side you get a new pane. If you scroll allllll the way down it lists Form Data.

This Form Data is what we need to send along with our request. We just need to convert it to a dictionary and send it along.

Sending data with the form request

Once we’ve converted our form data into a dictionary, we need to make sure of two things:

We’re using requests.post to make our request
We’re sending the form data with the request

Normal browser requests are sent as GET requests, but these very fancy ones are sent as POST. POST just means “hey I’m sending extra data along with this.”

data = {
    'P_QTE_CODE': 'ENG',
    'P_QTE_PGM_CODE': '7500',
    'P_LAST_NAME': 'smith',
    'P_FIRST_NAME': '',
    'P_INITIAL': '',
    'P_LICENSE_NUM': '',
    'P_CITY': '',
    'P_COUNTY': 'LOS ANGELES',
    'P_RECORD_SET_SIZE': '50',
    'Z_ACTION': 'Find'
}

# Get the page
# use .post
# send the data
url = "http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery"
response = requests.post(url, data=data)
doc = BeautifulSoup(response.text, 'html.parser')

# Grab all of the rows
row_tags = doc.find_all('tr')

# Let's print the first 5
for row in row_tags[:5]:
    print(row.text.strip())

NameTypeNumberStatusAddressCityZipCounty
SMITH                              A              M  OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES
SMITH                              ALLEN          EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES
SMITH                              ALVIN          JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES
SMITH                              ARTHUR         KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES

If we didn’t know if it worked or not, we could also check the response by looking at response.text.

Sending headers with your request

Sometimes that isn’t enough! Some web servers check to make sure you’re a real browser, or you came from their site, or other stuff like that.

We don’t need to do this for the Engineers page, but I’m going to do it anyway.

When you send a request, you also send thing called “Headers.” You can see the headers inside of the same Network tab part where you found Form Data. It’s listed as Request Headers - ignore the response headers.

Pretending to be the browser

The most common thing you’ll need to do is impersonate a browser by sending a User-Agent string. If we wanted to visit Columbia’s website pretending to be Chrome, we might do this:

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
requests.get("http://journalism.columbia.edu", headers=headers)

<Response [200]>

Finding the appropriate headers

Sometimes pretending to be the browser just isn’t enough. If you want to 100% imitate your browser when sending a request, you need to copy aaaaalllll of the headers from the request.

It’s just above the Form Data information, but I’ll tell you how to find it again just to be sure:

Open up Developer Tools in Chrome by selecting View > Developer > Developer Tools.
Select the Network Tab
Visit the page you’re going to do your search from
Click the Clear button up top - 🚫 - then submit your form
The Network tab will fill with activity!
Find the thing in the Network tab that looks like the same name as your webpage. Click it.
On the right-hand side you get a new pane. If you scroll near to the bottom it shows you Request Headers.

You just need to convert these into a dictionary, and send them along with your request.

Sending the appropriate headers

I just checked my results for the Engineers bit. It has a lot of headers!

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q =0.8 Accept-Encoding:gzip, deflate Accept-Language:en-US,en;q=0.8 Cache-Control:max-age=0 Connection:keep-alive Content-Length:156 Content-Type:application/x-www-form-urlencoded Host:www2.dca.ca.gov Origin:http://www2.dca.ca.gov Referer:http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG& p_qte_pgm_code=7500 Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36

I’m usually too lazy to copy all of them so I only take the ones I think I need, but if you’d like to it’s probably easier than the weird curl thing I talked about in class.

Let’s make a request using both headers and POST data.

# Here are all of our headers
headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'en-US,en;q=0.8',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Content-Length':'156',
    'Content-Type':'application/x-www-form-urlencoded',
    'Host':'www2.dca.ca.gov',
    'Origin':'http://www2.dca.ca.gov',
    'Referer':'http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG&p_qte_pgm_code=7500',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Here is the form data
data = {
    'P_QTE_CODE': 'ENG',
    'P_QTE_PGM_CODE': '7500',
    'P_LAST_NAME': 'smith',
    'P_FIRST_NAME': '',
    'P_INITIAL': '',
    'P_LICENSE_NUM': '',
    'P_CITY': '',
    'P_COUNTY': 'LOS ANGELES',
    'P_RECORD_SET_SIZE': '50',
    'Z_ACTION': 'Find'
}

# Get the page
# use .post
# send the data
# send the headers
url = "http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery"
response = requests.post(url, data=data, headers=headers)
doc = BeautifulSoup(response.text, 'html.parser')

# Grab all of the rows
row_tags = doc.find_all('tr')

# Let's print the first 5
for row in row_tags[:5]:
    print(row.text.strip())

NameTypeNumberStatusAddressCityZipCounty
SMITH                              A              M  OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES
SMITH                              ALLEN          EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES
SMITH                              ALVIN          JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES
SMITH                              ARTHUR         KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES

Perfect! By learning how .post requests, form data and headers work, you’re now going to be able to scrape a lot of very difficult sites.