Normal scraping
By now we all know how to scrape normal sites (kind of, mostly, somewhat).
Jo Cox, Member of British Parliament, Is Killed in Attack
Common Sense: Microsoft-LinkedIn Deal Ignites Twitter Speculation
Op-Ed Contributor: Let Me Compete in Rio
In Orlando, a Son of a Muslim Immigrant Rushed to Heal Pain Caused by Another
The First Big Company to Say It’s Serving the Legal Marijuana Trade? Microsoft.
Review: In ‘Finding Dory,’ a Forgetful Fish and a Warm Celebration of Differences
News Analysis: Why the Orlando Shooting Is Unlikely to Lead to Major New Gun Laws
Noted: Two Men Kiss, an Act of Love and Activism
Review: In ‘Finding Dory,’ a Forgetful Fish and a Warm Celebration of Differences
De Blasio’s $325 Million Ferry Push: Rides to 5 Boroughs, at Subway Price
Sports of The Times: In Russian Doping Scandal, Time for a Punishment to Fit the Crime
Is ‘Shrew’ Worth Taming? Female Directors Keep Trying
Fighting ISIS With an Algorithm, Physicists Try to Predict Attacks
Melvin Dwork, Once Cast From Navy for Being Gay, Dies at 94
Critic's Notebook: After Orlando Shooting, Talk Show Hosts Suggest Talk Is Not Enough
C.D.C. Reports 234 Pregnant Women in U.S. With Zika
4 Roller Coasters That Put the Theme in Theme Park
Books of The Times: Review: Annie Proulx’s ‘Barkskins’ Is an Epic Tale of Logging and Doom
Race/Related: Moving to Make Amends, Georgetown President Meets With Descendant of Slaves
Hungry City: The Secret to District Saigon’s Broths: Slow Cooking
Judith Shulevitz: How to Fix Feminism
The Hunt: In Brooklyn, a Home and Home Brewery
Public Health: Soda Tax Passes in Philadelphia. Advocates Ask: Who’s Next?
Feature: The Parasite Underground
Wheels: Skeptics of Self-Driving Cars Span Generations
When the Family Business is a Gallery
Looking Back: 1948-2016 | A Times Art Treasure Comes to an Omaha Library
But… forms!
So the issue is that sometimes you need to submit forms on a web site. Why? Well, let’s look at an example.
This example is going to come from Dan Nguyen’s incredible Search, Script, Scrape, 101 scraping exercises.
The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
Related URL: http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm
When you visit that URL, you’re going to type in “Fentanyl,” and select “Disc (Discontinued Drug Products).” Then you’ll hit search.
Hooray, results! Now look at the URL.
http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm
Does anything about that URL say “Fentanyl” or “Discontinued Drug Products”? Nope! And if you straight up visit it (might need to open an Incognito window) you’ll end up being redirected back to a different page.
This means requests.get
just isn’t going to cut it. If you tell requests
to download that page it’s going to get a whooole lot of uselessness.
Be my guest if you want to try it!
Submitting forms with requests
There are two kinds of forms, GET
forms and POST
forms (…this is 99%
true).
GET
forms
A GET
form is one where you can see parameters in the URL. For example, if you
searched for images of animals surfing on Bing you’d end up here:
http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2
It has a couple parameters - q
and FORM
. FORM
is some sort of weird
analytics thing that doesn’t affect the page, but q
is definitely the term
you’re searching for. With a GET
form, the data you put into the form is
kept in the URL.
Just for kicks, if we looked at the HTML for a GET
form it might look like
this:
<form method="GET" action="/search">
<input type="text" name="q">
</form>
It might also leave the whole method
part off, too - GET
is the default.
A fun part about GET
forms you can share the URL to share the results. If you
don’t believe me, visit http://www.bing.com/images/search?q=animals+surfing&FOR
M=HDRSC2 to
see animals surfing.
GET
is how most web pages work. You’ve used it every time you invoke the
unholy powers of requests.get
.
requests.get("https://api.spotify.com/v1/search?query=90s&offset=20&limit=20&typ
e=playlist")
GET
is nice. GET
is easy. But GET
is not all there is.
POST
forms
The other kind of forms are POST
forms. POST
forms are not friendly!
Unlike GET
forms, you can’t share the URL to get the same information. The
parameters - the q
for your search, for example - aren’t in the URL, they’re
hidden in the actual request.
What this means is that every time you request something from a POST
-based
form, you have to pretend you filled out the form and clicked the button.
Grabbing the parameters
First we need to find out what parameters we’re going to hunt down. To do this, first make your way to the form, then get prepared.
1) In Chrome, View
> Developer
> Developer Tools
2) Click the Network tab
3) Fill the form out, and submit it
4) Scroll up to the top of the Network pane, select the segment of the URL
you’re at (I’m at tempai.cfm
)
5) Click it
6) Select Headers on the right
7) Scroll down until you see Form Data
Okay, that seemed like a lot of work, but I promise it was actually simple and easy and you’re living life in a grand grand way. Two parameters are listed for the search we’re doing:
Generic_Name:Fentanyl
table1:OB_Disc
Seems simple enough! Now let’s put them to work.
Submitting POST
forms with requests.post
This is going to be so easy you might have a heart attack as a result of your body being so amazed that it doesn’t have to do anything strenuous. All you have to do is
requests.get("http://whatever.com/url/to/something", { "param1": "val1",
"param2": "val2" })
and treat it like a normal response! Here, I’ll prove it.
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Hospira
Innovar was produced by the company Akorn Mfg
Fentanyl-100 was produced by the company Noven
Fentanyl-25 was produced by the company Noven
Fentanyl-50 was produced by the company Noven
Fentanyl-75 was produced by the company Noven
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Fentanyl Citrate was produced by the company Abbott
Fentanyl Citrate was produced by the company Abbott
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate Preservative Free was produced by the company Watson Labs Inc
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentora was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
It’s magic, I swear!
But then…
Sometimes requests.get
just isn’t enough. Why? It mostly has to do with
JavaScript or complicated forms - when a site reacts and changes without loading
a new page, you can’t use requests
for that (think “Load more” buttons on
Instagram).
For those sites you need Selenium! Selenium = you put your browser on autopilot. As in, literally, it takes control over your browser. There are “headless” versions that use invisible browsers but if you don’t like to install a bunch of stuff, the normal version is usually fine.
Installing Selenium
Selenium isn’t just a Python package, but you’ll need to install python bindings in order to have Python talk to Selenium.
pip install selenium
You’ll also need the Firefox browser, since that’s the browser we’re going to be controlling.
Selenium is built on WebDrivers, which are libraries that let you… drive a browser. I believe it comes with a Firefox WebDriver, whereas Safari/Chrome/etc take a little more effort to set up.
Using Selenium
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html xmlns="http://www.w3.org/1999/xhtml"><head>\n\t\t<title>SearchResults</title>\n\t\t<link href="stylesheets/elicense2000.css" rel="stylesheet" />\n\t<script language="javascript">\nfunction SetValue(agencyID,licenseID)\n{\n\tvar href = window.opener.location.href;\n\tvar form = window.opener.document.forms[0];\n\tform.elements["license_id"].value = licenseID;\n\tform.elements["tempAgencyID"].value = agencyID;\n//\tvar action = href.substring(0,href.indexOf("?"));\n//\tform.action = action + \'?action=lu&agency_id=\' + agencyID + \'&license_id=\' + licenseID;\n\tform.submit();\n\twindow.close();\n}\n\t</script></head>\n\t\n\t<body>\n\t\t<form id="TheForm" action="SearchResults.aspx" method="post" name="TheForm">\n<input type="hidden" value="" id="__EVENTTARGET" name="__EVENTTARGET" />\n<input type="hidden" value="" id="__EVENTARGUMENT" name="__EVENTARGUMENT" />\n<input type="hidden" value="/wEPDwUJNzM2NTgwNzkyD2QWAgIBD2QWAmYPZBYCAgEPZBYCAgEPZBYCAgEPZBYCAgMPZBYCZg9kFgJmDxQrAAsPFhYeCVBhZ2VDb3VudAIBHhNBdXRvR2VuZXJhdGVDb2x1bW5zaB4IUGFnZVNpemUCKB4TVXNlQWNjZXNzaWJsZUhlYWRlcmceFV8hRGF0YVNvdXJjZUl0ZW1Db3VudAIEHghEYXRhS2V5cxYAHhBDdXJyZW50U29ydE9yZGVyBQ1mdWxsX25hbWUgQVNDHgxBbGxvd1NvcnRpbmdnHglmdWxsX25hbWUFBERFU0MeC18hSXRlbUNvdW50AgQeC0FsbG93UGFnaW5nZxYEHgVXaWR0aAUDOTYlHgVjbGFzcwULbW9kdWxlbGFiZWxkFgweCFBvc2l0aW9uCyonU3lzdGVtLldlYi5VSS5XZWJDb250cm9scy5QYWdlclBvc2l0aW9uAB4ETW9kZQsqI1N5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuUGFnZXJNb2RlAR4PUGFnZUJ1dHRvbkNvdW50AigeCUJhY2tDb2xvcgnIoIP/HglGb3JlQ29sb3IKpAEeBF8hU0ICjICgBhYEHxAJyKCD/x8SAghkZGRkZGRkZGSIEw/cNbJxXUvombaLhwB7OaELE3j/8735+JXDjjDl1w==" id="__VIEWSTATE" name="__VIEWSTATE" />\n\n<script type="text/javascript">\n<!--\nvar theForm = document.forms[\'TheForm\'];\nif (!theForm) {\n theForm = document.TheForm;\n}\nfunction __doPostBack(eventTarget, eventArgument) {\n if (!theForm.onsubmit || (theForm.onsubmit() != false)) {\n theForm.__EVENTTARGET.value = eventTarget;\n theForm.__EVENTARGUMENT.value = eventArgument;\n theForm.submit();\n }\n}\n// -->\n</script>\n\n\n<input type="hidden" value="/wEWAwLU8a+SDwLHlavPBALHlb+qDcvadpJP4En4SIP0Wcqc7vac2W7GvimAE4UBJUh9MPev" id="__EVENTVALIDATION" name="__EVENTVALIDATION" /><table width="100%" cellspacing="0" cellpadding="0" border="0" align="left" class="layout" role="presentation">\n\t<tbody><tr>\n\t\t<td width="100%" height="10" colspan="2" rowspan="1"><table width="100%" cellspacing="0" align="center" role="presentation">\n\t\t\t<tbody><tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1" style="background-color:#29498C"><span id="blue_bar"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td align="right" colspan="7" rowspan="1"><span id="top_row_links"><a target="_blank" class="bannerLinkTop" href="http://311.dc.gov/">311 Online</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/directory">Agency Directory</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/online-services">Online Services</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/page/dcgov-accessibility-policy">Accessibility</a></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="8" rowspan="1"><span cellpadding="0" cellspacing="0" id="banner_image"><img alt="DC Logo" src="images/DC_pictures/dcgov_logo.jpg" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="8" rowspan="1"><span style="font-weight:bold; font-size:large" id="banner_label">Department of Health</span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1"><span id="spacer_1"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="5px" colspan="8" rowspan="1" style="background-color: #CECFCE"><span id="gray_bar"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1"><span id="spacer_2"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_1"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/">DOH Home</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_2"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/services">Services</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_3"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/health-professionals">Health Professionals</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" style="width:150px;" class="bannerbar"><span id="link_4"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/infants-children-teens-and-school-health">Infants, Children and Teens</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_5"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/HIV/AIDS%20Services">HIV/AIDS</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_6"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/page/resources-01">Resources</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_7"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/vital-records">Vital Records</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_8"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/page/about-doh">About DOH</a></span></td>\n\t\t\t</tr>\n\t\t</tbody></table>\n\t\t</td>\n\t</tr>\n\t<tr>\n\t\t<td></td>\n\t\t<td valign="top" align="left" colspan="1" rowspan="1" cellpadding="0" cellspacing="0"><table width="100%" cellspacing="0" cellpadding="0" border="0" role="presentation">\n\t\t</table>\n\t\t<table width="90%" cellspacing="0" cellpadding="0" border="0" align="center" role="presentation">\n\t\t\t<tbody><tr>\n\t\t\t\t<td width="10" colspan="1" rowspan="20"><span id="col_spacer"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td valign="top" align="left" colspan="1" rowspan="0"><span class="moduleLabel" id="Title">Search Results</span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="1" rowspan="1"><span class="instructions" id="Back_to_Search">For a more detailed view of a licensee\'s background, select the licensee name from the alphabetical list below. Click the numbers below the grid to see additional pages of licensees. To return to the Search page, use the Search Again button below. (Do not use the browser Back key.)<br /><img width="650" border="0" height="10" src="images/dot.gif" alt="" /><br /><input type="reset" onclick="javascript:document.location.href=\'Search.aspx\' " maxlength="9" id="my_button" value="Search Again" name="my_button" /><br /><img width="650" border="0" height="10" src="images/dot.gif" alt="" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="2" rowspan="0"><table width="96%" cellspacing="0" border="1" style="border-collapse:collapse;" id="datagrid_results" class="modulelabel" rules="all">\n\t\t\t\t\t<tbody><tr style="background-color:#83A0C8;">\n\t\t\t\t\t\t<th scope="col"><a href="javascript:__doPostBack(\'datagrid_results$_ctl2$_ctl0\',\'\')"><font size="2" color="white" face="Arial"><b>Full Name</b></font></a></th><th scope="col"><a href="javascript:__doPostBack(\'datagrid_results$_ctl2$_ctl1\',\'\')"><font size="2" color="white" face="Arial"><b>Number</b></font></a></th><th scope="col"><font size="2" color="white" face="Arial"><b>Profession</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>Type</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>Status</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>City</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>State</b></font></th>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl3_result"><a target="_blank" href="Details.aspx?result=e37a7c27-0910-4478-9e63-4af93e5a4117" id="datagrid_results__ctl3_hl">KATHERINE A. SALE</a></td><td><span>AC30086</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Expired</span></td><td><span>ARNOLD</span></td><td><span>MD</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl4_result"><a target="_blank" href="Details.aspx?result=39420388-2cd0-4c88-b078-942713c2c4b5" id="datagrid_results__ctl4_hl">KATHERINE F SEARS</a></td><td><span>AC30023</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Active</span></td><td><span>Unknown</span></td><td><span>NA</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl5_result"><a target="_blank" href="Details.aspx?result=fdcbea2e-dd68-44c1-8a1e-6bad4df28699" id="datagrid_results__ctl5_hl">KATHERINE J KAPUSNIK</a></td><td><span>AC500105</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Expired</span></td><td><span>Unknown</span></td><td><span>NA</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl6_result"><a target="_blank" href="Details.aspx?result=02928cdf-eed1-49ba-b7b5-d67e8c9bfb23" id="datagrid_results__ctl6_hl">KATHERINE S. YONKERS</a></td><td><span>AC30057</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Active</span></td><td><span>WASHINGTON</span></td><td><span>DC</span></td>\n\t\t\t\t\t</tr><tr style="color:White;background-color:#83A0C8;">\n\t\t\t\t\t\t<td colspan="7"><span>1</span></td>\n\t\t\t\t\t</tr>\n\t\t\t\t</tbody></table></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td></td>\n\t\t\t\t<td colspan="1" rowspan="1"><span id="footer_spacer"><img width="10" border="0" height="5" src="images/dot.gif" alt="" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td align="center" colspan="4" rowspan="1"><span id="footer"> <br /><br /><hr /> <a target="_blank" href="http://doh.dc.gov/page/dcgov-accessibility-policy">Accessibility</a>\xa0\xa0\xa0<a target="_blank" href="http://doh.dc.gov/page/privacy-and-security">Privacy and Security</a>\xa0\xa0\xa0<a target="_blank" href="http://doh.dc.gov/page/terms-and-conditions-use">Terms and Conditions</a>\xa0\xa0\xa0<a target="_blank" href="http://dc.gov/page/about-district-government-website">About DC.Gov</a>\n </span></td>\n\t\t\t</tr>\n\t\t</tbody></table>\n\t\t</td>\n\t</tr>\n</tbody></table>\n</form>\n\t\n\n</body></html>'
[{'city': 'ARNOLD',
'name': 'KATHERINE A. SALE',
'number': 'AC30086',
'profession': 'MEDICINE',
'state': 'MD',
'status': 'Expired',
'type': 'ACUPUNCTURIST'},
{'city': 'Unknown',
'name': 'KATHERINE F SEARS',
'number': 'AC30023',
'profession': 'MEDICINE',
'state': 'NA',
'status': 'Active',
'type': 'ACUPUNCTURIST'},
{'city': 'Unknown',
'name': 'KATHERINE J KAPUSNIK',
'number': 'AC500105',
'profession': 'MEDICINE',
'state': 'NA',
'status': 'Expired',
'type': 'ACUPUNCTURIST'},
{'city': 'WASHINGTON',
'name': 'KATHERINE S. YONKERS',
'number': 'AC30057',
'profession': 'MEDICINE',
'state': 'DC',
'status': 'Active',
'type': 'ACUPUNCTURIST'}]
Closing the webdriver
Once we have all the data we want, we can close our webdriver.
Saving our data
Now what are we going to do with our list of dictionaries? We could use a
csv.DictWriter
like in this
post, but it’s actually quicker to do it with
pandas
.
Step One: import pandas
Step Two: Turn list into a DataFrame
city | name | number | profession | state | status | type | |
---|---|---|---|---|---|---|---|
0 | ARNOLD | KATHERINE A. SALE | AC30086 | MEDICINE | MD | Expired | ACUPUNCTURIST |
1 | Unknown | KATHERINE F SEARS | AC30023 | MEDICINE | NA | Active | ACUPUNCTURIST |
2 | Unknown | KATHERINE J KAPUSNIK | AC500105 | MEDICINE | NA | Expired | ACUPUNCTURIST |
3 | WASHINGTON | KATHERINE S. YONKERS | AC30057 | MEDICINE | DC | Active | ACUPUNCTURIST |
Step Three: Save it to a CSV
While you’re saving it, set index=False
or else it will include 0
, 1
, 2
,
etc from the further-left column (the index, of course).
Step Four: Party down
I don’t have directions for this one