import requests
from bs4 import BeautifulSoup

Normal scraping

By now we all know how to scrape normal sites (kind of, mostly, somewhat).

# Grab the NYT's homepage
response = requests.get("http://nytimes.com")
soup = BeautifulSoup(response.text, 'html.parser')
# Snag all of the headlines (h3 tags with 'story-heading' class)
headlines = soup.find_all("h3", { 'class': 'story-heading'} )

# Getting the headline text out using list comprehensions
# is a lot more fun but I guess you just learned those
# like a day ago, so we'll go ahead and use a for loop.
# But for the curious:
#   [headline.text.strip() for headline in headlines]

# Print the text of the headlines
for headline in headlines:
    print(headline.text.strip())
Jo Cox, Member of British Parliament, Is Killed in Attack
Common Sense: Microsoft-LinkedIn Deal Ignites Twitter Speculation
Op-Ed Contributor: Let Me Compete in Rio
In Orlando, a Son of a Muslim Immigrant Rushed to Heal Pain Caused by Another
The First Big Company to Say It’s Serving the Legal Marijuana Trade? Microsoft.
Review: In ‘Finding Dory,’ a Forgetful Fish and a Warm Celebration of Differences
News Analysis: Why the Orlando Shooting Is Unlikely to Lead to Major New Gun Laws
Noted: Two Men Kiss, an Act of Love and Activism
Review: In ‘Finding Dory,’ a Forgetful Fish and a Warm Celebration of Differences
De Blasio’s $325 Million Ferry Push: Rides to 5 Boroughs, at Subway Price
Sports of The Times: In Russian Doping Scandal, Time for a Punishment to Fit the Crime
Is ‘Shrew’ Worth Taming? Female Directors Keep Trying
Fighting ISIS With an Algorithm, Physicists Try to Predict Attacks
Melvin Dwork, Once Cast From Navy for Being Gay, Dies at 94
Critic's Notebook: After Orlando Shooting, Talk Show Hosts Suggest Talk Is Not Enough
C.D.C. Reports 234 Pregnant Women in U.S. With Zika
4 Roller Coasters That Put the Theme in Theme Park
Books of The Times: Review: Annie Proulx’s ‘Barkskins’ Is an Epic Tale of Logging and Doom
Race/Related: Moving to Make Amends, Georgetown President Meets With Descendant of Slaves
Hungry City: The Secret to District Saigon’s Broths: Slow Cooking
Judith Shulevitz: How to Fix Feminism
The Hunt: In Brooklyn, a Home and Home Brewery
Public Health: Soda Tax Passes in Philadelphia. Advocates Ask: Who’s Next?
Feature: The Parasite Underground
Wheels: Skeptics of Self-Driving Cars Span Generations
When the Family Business is a Gallery
Looking Back: 1948-2016 | A Times Art Treasure Comes to an Omaha Library

But… forms!

So the issue is that sometimes you need to submit forms on a web site. Why? Well, let’s look at an example.

This example is going to come from Dan Nguyen’s incredible Search, Script, Scrape, 101 scraping exercises.

The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient

Related URL: http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm

When you visit that URL, you’re going to type in “Fentanyl,” and select “Disc (Discontinued Drug Products).” Then you’ll hit search.

Hooray, results! Now look at the URL.

http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm

Does anything about that URL say “Fentanyl” or “Discontinued Drug Products”? Nope! And if you straight up visit it (might need to open an Incognito window) you’ll end up being redirected back to a different page.

This means requests.get just isn’t going to cut it. If you tell requests to download that page it’s going to get a whooole lot of uselessness.

Be my guest if you want to try it!

Submitting forms with requests

There are two kinds of forms, GET forms and POST forms (…this is 99% true).

GET forms

A GET form is one where you can see parameters in the URL. For example, if you searched for images of animals surfing on Bing you’d end up here:

http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2

It has a couple parameters - q and FORM. FORM is some sort of weird analytics thing that doesn’t affect the page, but q is definitely the term you’re searching for. With a GET form, the data you put into the form is kept in the URL.

Just for kicks, if we looked at the HTML for a GET form it might look like this:

<form method="GET" action="/search">
<input type="text" name="q">
</form>

It might also leave the whole method part off, too - GET is the default.

A fun part about GET forms you can share the URL to share the results. If you don’t believe me, visit http://www.bing.com/images/search?q=animals+surfing&FOR M=HDRSC2 to see animals surfing.

GET is how most web pages work. You’ve used it every time you invoke the unholy powers of requests.get.

requests.get("https://api.spotify.com/v1/search?query=90s&offset=20&limit=20&typ
e=playlist")

GET is nice. GET is easy. But GET is not all there is.

POST forms

The other kind of forms are POST forms. POST forms are not friendly!

Unlike GET forms, you can’t share the URL to get the same information. The parameters - the q for your search, for example - aren’t in the URL, they’re hidden in the actual request.

What this means is that every time you request something from a POST-based form, you have to pretend you filled out the form and clicked the button.

Grabbing the parameters

First we need to find out what parameters we’re going to hunt down. To do this, first make your way to the form, then get prepared.

1) In Chrome, View > Developer > Developer Tools 2) Click the Network tab 3) Fill the form out, and submit it 4) Scroll up to the top of the Network pane, select the segment of the URL you’re at (I’m at tempai.cfm) 5) Click it 6) Select Headers on the right 7) Scroll down until you see Form Data

Okay, that seemed like a lot of work, but I promise it was actually simple and easy and you’re living life in a grand grand way. Two parameters are listed for the search we’re doing:

Generic_Name:Fentanyl
table1:OB_Disc

Seems simple enough! Now let’s put them to work.

Submitting POST forms with requests.post

This is going to be so easy you might have a heart attack as a result of your body being so amazed that it doesn’t have to do anything strenuous. All you have to do is

requests.get("http://whatever.com/url/to/something", { "param1": "val1",
"param2": "val2" })

and treat it like a normal response! Here, I’ll prove it.

# Just in case you didn't run it up there, I'll import again
import requests

url = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm'
post_params = {'Generic_Name': 'Fentanyl', 'table1': 'OB_Disc'}
response = requests.post(url, data=post_params)
soup = BeautifulSoup(response.text, 'html.parser')
# Using .select instead of .find is a little more
# readable to people from the web dev world, maybe?
rows = soup.select(".actual tbody tr")
for row in rows:
    columns = row.select("td")
    # Let's titlecase them SO THEY AREN'T ALL CAPS
    drug_name = columns[4].text.title()
    company_name = columns[5].text.title()
    print(drug_name, "was produced by the company", company_name)
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Astrazeneca
Fentanyl Citrate And Droperidol was produced by the company Hospira
Innovar was produced by the company Akorn Mfg
Fentanyl-100 was produced by the company Noven
Fentanyl-25 was produced by the company Noven
Fentanyl-50 was produced by the company Noven
Fentanyl-75 was produced by the company Noven
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Onsolis was produced by the company Biodelivery Sci Intl
Fentanyl Citrate was produced by the company Abbott
Fentanyl Citrate was produced by the company Abbott
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate Preservative Free was produced by the company Watson Labs Inc
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentanyl Citrate was produced by the company Watson Labs
Fentora was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon
Fentanyl was produced by the company Cephalon

It’s magic, I swear!

But then…

Sometimes requests.get just isn’t enough. Why? It mostly has to do with JavaScript or complicated forms - when a site reacts and changes without loading a new page, you can’t use requests for that (think “Load more” buttons on Instagram).

For those sites you need Selenium! Selenium = you put your browser on autopilot. As in, literally, it takes control over your browser. There are “headless” versions that use invisible browsers but if you don’t like to install a bunch of stuff, the normal version is usually fine.

Installing Selenium

Selenium isn’t just a Python package, but you’ll need to install python bindings in order to have Python talk to Selenium.

pip install selenium

You’ll also need the Firefox browser, since that’s the browser we’re going to be controlling.

Selenium is built on WebDrivers, which are libraries that let you… drive a browser. I believe it comes with a Firefox WebDriver, whereas Safari/Chrome/etc take a little more effort to set up.

Using Selenium

# Imports, of course
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
# Initialize a Firefox webdriver
driver = webdriver.Firefox()
# Grab the web page
driver.get("https://app.hpla.doh.dc.gov/Weblookup/")
# You'll use selenium.webdriver.support.ui.Select
# that we imported above to grab the Seelct element called 
# t_web_lookup__license_type_name, then select Acupuncturists

# We use .find_element_by_name here because we know the name
dropdown = Select(driver.find_element_by_name("t_web_lookup__license_type_name"))
dropdown.select_by_value("ACUPUNCTURIST")
# We use .find_element_by_id here because we know the id
text_input = driver.find_element_by_id("t_web_lookup__first_name")

# Then we'll fake typing into it
text_input.send_keys("KATHERINE")
# Now we can grab the search button and click it
search_button = driver.find_element_by_id("sch_button")
search_button.click()
# Instead of using requests.get, we just look at .page_source of the driver
driver.page_source
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html xmlns="http://www.w3.org/1999/xhtml"><head>\n\t\t<title>SearchResults</title>\n\t\t<link href="stylesheets/elicense2000.css" rel="stylesheet" />\n\t<script language="javascript">\nfunction SetValue(agencyID,licenseID)\n{\n\tvar href = window.opener.location.href;\n\tvar form = window.opener.document.forms[0];\n\tform.elements["license_id"].value = licenseID;\n\tform.elements["tempAgencyID"].value = agencyID;\n//\tvar action = href.substring(0,href.indexOf("?"));\n//\tform.action = action + \'?action=lu&amp;agency_id=\' + agencyID + \'&amp;license_id=\' + licenseID;\n\tform.submit();\n\twindow.close();\n}\n\t</script></head>\n\t\n\t<body>\n\t\t<form id="TheForm" action="SearchResults.aspx" method="post" name="TheForm">\n<input type="hidden" value="" id="__EVENTTARGET" name="__EVENTTARGET" />\n<input type="hidden" value="" id="__EVENTARGUMENT" name="__EVENTARGUMENT" />\n<input type="hidden" value="/wEPDwUJNzM2NTgwNzkyD2QWAgIBD2QWAmYPZBYCAgEPZBYCAgEPZBYCAgEPZBYCAgMPZBYCZg9kFgJmDxQrAAsPFhYeCVBhZ2VDb3VudAIBHhNBdXRvR2VuZXJhdGVDb2x1bW5zaB4IUGFnZVNpemUCKB4TVXNlQWNjZXNzaWJsZUhlYWRlcmceFV8hRGF0YVNvdXJjZUl0ZW1Db3VudAIEHghEYXRhS2V5cxYAHhBDdXJyZW50U29ydE9yZGVyBQ1mdWxsX25hbWUgQVNDHgxBbGxvd1NvcnRpbmdnHglmdWxsX25hbWUFBERFU0MeC18hSXRlbUNvdW50AgQeC0FsbG93UGFnaW5nZxYEHgVXaWR0aAUDOTYlHgVjbGFzcwULbW9kdWxlbGFiZWxkFgweCFBvc2l0aW9uCyonU3lzdGVtLldlYi5VSS5XZWJDb250cm9scy5QYWdlclBvc2l0aW9uAB4ETW9kZQsqI1N5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuUGFnZXJNb2RlAR4PUGFnZUJ1dHRvbkNvdW50AigeCUJhY2tDb2xvcgnIoIP/HglGb3JlQ29sb3IKpAEeBF8hU0ICjICgBhYEHxAJyKCD/x8SAghkZGRkZGRkZGSIEw/cNbJxXUvombaLhwB7OaELE3j/8735+JXDjjDl1w==" id="__VIEWSTATE" name="__VIEWSTATE" />\n\n<script type="text/javascript">\n&lt;!--\nvar theForm = document.forms[\'TheForm\'];\nif (!theForm) {\n    theForm = document.TheForm;\n}\nfunction __doPostBack(eventTarget, eventArgument) {\n    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {\n        theForm.__EVENTTARGET.value = eventTarget;\n        theForm.__EVENTARGUMENT.value = eventArgument;\n        theForm.submit();\n    }\n}\n// --&gt;\n</script>\n\n\n<input type="hidden" value="/wEWAwLU8a+SDwLHlavPBALHlb+qDcvadpJP4En4SIP0Wcqc7vac2W7GvimAE4UBJUh9MPev" id="__EVENTVALIDATION" name="__EVENTVALIDATION" /><table width="100%" cellspacing="0" cellpadding="0" border="0" align="left" class="layout" role="presentation">\n\t<tbody><tr>\n\t\t<td width="100%" height="10" colspan="2" rowspan="1"><table width="100%" cellspacing="0" align="center" role="presentation">\n\t\t\t<tbody><tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1" style="background-color:#29498C"><span id="blue_bar"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td align="right" colspan="7" rowspan="1"><span id="top_row_links"><a target="_blank" class="bannerLinkTop" href="http://311.dc.gov/">311 Online</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/directory">Agency Directory</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/online-services">Online Services</a> <a target="_blank" class="bannerLinkTop" href="http://dc.gov/page/dcgov-accessibility-policy">Accessibility</a></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="8" rowspan="1"><span cellpadding="0" cellspacing="0" id="banner_image"><img alt="DC Logo" src="images/DC_pictures/dcgov_logo.jpg" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="8" rowspan="1"><span style="font-weight:bold; font-size:large" id="banner_label">Department of Health</span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1"><span id="spacer_1"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="5px" colspan="8" rowspan="1" style="background-color: #CECFCE"><span id="gray_bar"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td height="10px" colspan="8" rowspan="1"><span id="spacer_2"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_1"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/">DOH Home</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_2"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/services">Services</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_3"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/health-professionals">Health Professionals</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" style="width:150px;" class="bannerbar"><span id="link_4"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/infants-children-teens-and-school-health">Infants, Children and Teens</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_5"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/HIV/AIDS%20Services">HIV/AIDS</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_6"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/page/resources-01">Resources</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_7"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/service/vital-records">Vital Records</a></span></td>\n\t\t\t\t<td colspan="1" rowspan="1" class="bannerbar"><span id="link_8"><a target="_blank" class="bannerBarLink" href="http://doh.dc.gov/page/about-doh">About DOH</a></span></td>\n\t\t\t</tr>\n\t\t</tbody></table>\n\t\t</td>\n\t</tr>\n\t<tr>\n\t\t<td></td>\n\t\t<td valign="top" align="left" colspan="1" rowspan="1" cellpadding="0" cellspacing="0"><table width="100%" cellspacing="0" cellpadding="0" border="0" role="presentation">\n\t\t</table>\n\t\t<table width="90%" cellspacing="0" cellpadding="0" border="0" align="center" role="presentation">\n\t\t\t<tbody><tr>\n\t\t\t\t<td width="10" colspan="1" rowspan="20"><span id="col_spacer"></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td valign="top" align="left" colspan="1" rowspan="0"><span class="moduleLabel" id="Title">Search Results</span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="1" rowspan="1"><span class="instructions" id="Back_to_Search">For a more detailed view of a licensee\'s background, select the licensee name from the alphabetical list below. Click the numbers below the grid to see additional pages of licensees. To return to the Search page, use the Search Again button below. (Do not use the browser Back key.)<br /><img width="650" border="0" height="10" src="images/dot.gif" alt="" /><br /><input type="reset" onclick="javascript:document.location.href=\'Search.aspx\' " maxlength="9" id="my_button" value="Search Again" name="my_button" /><br /><img width="650" border="0" height="10" src="images/dot.gif" alt="" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td colspan="2" rowspan="0"><table width="96%" cellspacing="0" border="1" style="border-collapse:collapse;" id="datagrid_results" class="modulelabel" rules="all">\n\t\t\t\t\t<tbody><tr style="background-color:#83A0C8;">\n\t\t\t\t\t\t<th scope="col"><a href="javascript:__doPostBack(\'datagrid_results$_ctl2$_ctl0\',\'\')"><font size="2" color="white" face="Arial"><b>Full Name</b></font></a></th><th scope="col"><a href="javascript:__doPostBack(\'datagrid_results$_ctl2$_ctl1\',\'\')"><font size="2" color="white" face="Arial"><b>Number</b></font></a></th><th scope="col"><font size="2" color="white" face="Arial"><b>Profession</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>Type</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>Status</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>City</b></font></th><th scope="col"><font size="2" color="white" face="Arial"><b>State</b></font></th>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl3_result"><a target="_blank" href="Details.aspx?result=e37a7c27-0910-4478-9e63-4af93e5a4117" id="datagrid_results__ctl3_hl">KATHERINE A. SALE</a></td><td><span>AC30086</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Expired</span></td><td><span>ARNOLD</span></td><td><span>MD</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl4_result"><a target="_blank" href="Details.aspx?result=39420388-2cd0-4c88-b078-942713c2c4b5" id="datagrid_results__ctl4_hl">KATHERINE F SEARS</a></td><td><span>AC30023</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Active</span></td><td><span>Unknown</span></td><td><span>NA</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl5_result"><a target="_blank" href="Details.aspx?result=fdcbea2e-dd68-44c1-8a1e-6bad4df28699" id="datagrid_results__ctl5_hl">KATHERINE J KAPUSNIK</a></td><td><span>AC500105</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Expired</span></td><td><span>Unknown</span></td><td><span>NA</span></td>\n\t\t\t\t\t</tr><tr>\n\t\t\t\t\t\t<td id="datagrid_results__ctl6_result"><a target="_blank" href="Details.aspx?result=02928cdf-eed1-49ba-b7b5-d67e8c9bfb23" id="datagrid_results__ctl6_hl">KATHERINE S. YONKERS</a></td><td><span>AC30057</span></td><td><span>MEDICINE</span></td><td><span>ACUPUNCTURIST</span></td><td><span>Active</span></td><td><span>WASHINGTON</span></td><td><span>DC</span></td>\n\t\t\t\t\t</tr><tr style="color:White;background-color:#83A0C8;">\n\t\t\t\t\t\t<td colspan="7"><span>1</span></td>\n\t\t\t\t\t</tr>\n\t\t\t\t</tbody></table></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td></td>\n\t\t\t\t<td colspan="1" rowspan="1"><span id="footer_spacer"><img width="10" border="0" height="5" src="images/dot.gif" alt="" /></span></td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td align="center" colspan="4" rowspan="1"><span id="footer">  <br /><br /><hr /> <a target="_blank" href="http://doh.dc.gov/page/dcgov-accessibility-policy">Accessibility</a>\xa0\xa0\xa0<a target="_blank" href="http://doh.dc.gov/page/privacy-and-security">Privacy and Security</a>\xa0\xa0\xa0<a target="_blank" href="http://doh.dc.gov/page/terms-and-conditions-use">Terms and Conditions</a>\xa0\xa0\xa0<a target="_blank" href="http://dc.gov/page/about-district-government-website">About DC.Gov</a>\n                </span></td>\n\t\t\t</tr>\n\t\t</tbody></table>\n\t\t</td>\n\t</tr>\n</tbody></table>\n</form>\n\t\n\n</body></html>'
# We can feed that into Beautiful Soup
doc = BeautifulSoup(driver.page_source, "html.parser")
# It's a tricky table, but this grabs the linked names inside of the A
#rows = doc.select("#datagrid_results tr")
rows = doc.find('table', id='datagrid_results').find_all('tr', attrs={'class': None})

doctors = []
for row in rows:
    # print(row.attrs)
    # Find the ones that don't have 'style' as an attribute
    if 'style' in row.attrs:
        # Skip it! It's a header or footer row
        pass
    else:
        cells = row.find_all("td")
        doctor = {
            'name': cells[0].text,
            'number': cells[1].text,
            'profession': cells[2].text,
            'type': cells[3].text,
            'status': cells[4].text,
            'city': cells[5].text,
            'state': cells[6].text
        }
        doctors.append(doctor)
doctors
[{'city': 'ARNOLD',
  'name': 'KATHERINE A. SALE',
  'number': 'AC30086',
  'profession': 'MEDICINE',
  'state': 'MD',
  'status': 'Expired',
  'type': 'ACUPUNCTURIST'},
 {'city': 'Unknown',
  'name': 'KATHERINE F SEARS',
  'number': 'AC30023',
  'profession': 'MEDICINE',
  'state': 'NA',
  'status': 'Active',
  'type': 'ACUPUNCTURIST'},
 {'city': 'Unknown',
  'name': 'KATHERINE J KAPUSNIK',
  'number': 'AC500105',
  'profession': 'MEDICINE',
  'state': 'NA',
  'status': 'Expired',
  'type': 'ACUPUNCTURIST'},
 {'city': 'WASHINGTON',
  'name': 'KATHERINE S. YONKERS',
  'number': 'AC30057',
  'profession': 'MEDICINE',
  'state': 'DC',
  'status': 'Active',
  'type': 'ACUPUNCTURIST'}]

Closing the webdriver

Once we have all the data we want, we can close our webdriver.

# Close the webdriver
driver.close()

Saving our data

Now what are we going to do with our list of dictionaries? We could use a csv.DictWriter like in this post, but it’s actually quicker to do it with pandas.

Step One: import pandas

import pandas as pd

Step Two: Turn list into a DataFrame

doctors_df = pd.DataFrame(doctors)
doctors_df
city name number profession state status type
0 ARNOLD KATHERINE A. SALE AC30086 MEDICINE MD Expired ACUPUNCTURIST
1 Unknown KATHERINE F SEARS AC30023 MEDICINE NA Active ACUPUNCTURIST
2 Unknown KATHERINE J KAPUSNIK AC500105 MEDICINE NA Expired ACUPUNCTURIST
3 WASHINGTON KATHERINE S. YONKERS AC30057 MEDICINE DC Active ACUPUNCTURIST

Step Three: Save it to a CSV

While you’re saving it, set index=False or else it will include 0, 1, 2, etc from the further-left column (the index, of course).

doctors_df.to_csv("../doctors.csv", index=False)

Step Four: Party down

I don’t have directions for this one