In the beginning there were GET forms
When you’re searching for water at Walmart, the URL looks like this:
https://www.walmart.com/search/?query=water&cat_id=0
It’s easy to scrape! If you wanted to search for guns
instead, you’d just
change water
to guns
in the URL and off you go. This nice way of living is
parameters in the query string.
Sam's Choice Purified Drinking Water, 10 fl oz, 12 pack
Nestlé Pure Life Purified Water 12 x 16.9fl oz (202.8fl oz)
Fiji Natural Artesian Water, 6pk
Gerber Pure Purified Water, 1.0 GAL
ArrowHead 100% Mountain Spring Water 6 x 23.7 fl oz (142.2 fl oz)
But it isn’t always like that.
But then: POST Forms
But for most forms, though, it isn’t that easy. You type in your info, you click “Search”, and there’s nothing in the URL. For example, try searching at California’s Engineer License Database.
The URL you end up at is something like
http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery
, which doesn’t
mean anything. No parameters in that query string!
If you search through the browser you see a lot of table rows, but if you try it in Python it doesn’t give you anything.
Nothing at all! What did it give us? Let’s look at response.text
.
'<HTML>\n<HEAD>\n<TITLE>License Holders : </TITLE>\n</HEAD>\n<BODY bgcolor="#ffffff">\n<H1>License Holders : </H1>\n<P>\n<B><font color="ff4040" size=+2><I>Error!</I></font><br></B>\n<B>The following unhandled error has occurred in the routine WLLQRYNA$LCEV2.ActionQuery:</B>\n<P>\nORA-01403: no data found\n<P>\n<B>Please contact your support representative.</B>\n<P>\n</BODY>\n</HTML>\n'
If you read closely, that’s an error. It’s because we didn’t send it any search data.
Looking at
response.text
is THE BEST WAY to find out whether your search worked. You can ctrl+f or just visually search for words you know should be on the page.
Finding our form data
When we clicked “Search,” it also sent the server a bunch of data - all of the options we typed in, or the dropdowns we selected. Here are the steps to find out what data needs to be sent along with your request.
We’re going to use Chrome’s Network tools to analyze all of the requests our browser sends to the server, then imitate them in Python.
- Open up Developer Tools in Chrome by selecting
View > Developer > Developer Tools
. - Select the Network Tab
- Visit the page you’re going to do your search from
- Click the Clear button up top - 🚫 - then submit your form
- The Network tab will fill with activity!
- Find the thing in the Network tab that looks like the same name as your webpage. Click it.
- On the right-hand side you get a new pane. If you scroll allllll the way down it lists Form Data.
This Form Data is what we need to send along with our request. We just need to convert it to a dictionary and send it along.
Sending data with the form request
Once we’ve converted our form data into a dictionary, we need to make sure of two things:
- We’re using
requests.post
to make our request - We’re sending the form data with the request
Normal browser requests are sent as GET
requests, but these very fancy ones
are sent as POST
. POST
just means “hey I’m sending extra data along with
this.”
NameTypeNumberStatusAddressCityZipCounty
SMITH A M OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES
SMITH ALLEN EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES
SMITH ALVIN JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES
SMITH ARTHUR KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES
If we didn’t know if it worked or not, we could also check the response by
looking at response.text
.
Sending headers with your request
Sometimes that isn’t enough! Some web servers check to make sure you’re a real browser, or you came from their site, or other stuff like that.
We don’t need to do this for the Engineers page, but I’m going to do it anyway.
When you send a request, you also send thing called “Headers.” You can see the headers inside of the same Network tab part where you found Form Data. It’s listed as Request Headers - ignore the response headers.
Pretending to be the browser
The most common thing you’ll need to do is impersonate a browser by sending a
User-Agent
string. If we wanted to visit Columbia’s website pretending to be
Chrome, we might do this:
<Response [200]>
Finding the appropriate headers
Sometimes pretending to be the browser just isn’t enough. If you want to 100% imitate your browser when sending a request, you need to copy aaaaalllll of the headers from the request.
It’s just above the Form Data information, but I’ll tell you how to find it again just to be sure:
- Open up Developer Tools in Chrome by selecting
View > Developer > Developer Tools
. - Select the Network Tab
- Visit the page you’re going to do your search from
- Click the Clear button up top - 🚫 - then submit your form
- The Network tab will fill with activity!
- Find the thing in the Network tab that looks like the same name as your webpage. Click it.
- On the right-hand side you get a new pane. If you scroll near to the bottom it shows you Request Headers.
You just need to convert these into a dictionary, and send them along with your request.
Sending the appropriate headers
I just checked my results for the Engineers bit. It has a lot of headers!
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q
=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:156
Content-Type:application/x-www-form-urlencoded
Host:www2.dca.ca.gov
Origin:http://www2.dca.ca.gov
Referer:http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG&
p_qte_pgm_code=7500
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
I’m usually too lazy to copy all of them so I only take the ones I think I need,
but if you’d like to it’s probably easier than the weird curl
thing I talked
about in class.
Let’s make a request using both headers and POST data.
NameTypeNumberStatusAddressCityZipCounty
SMITH A M OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES
SMITH ALLEN EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES
SMITH ALVIN JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES
SMITH ARTHUR KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES
Perfect! By learning how .post
requests, form data and headers work, you’re
now going to be able to scrape a lot of very difficult sites.