{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In the beginning there were GET forms\n", "\n", "When you're searching for [water at Walmart](https://www.walmart.com/search/?query=water&cat_id=0), the URL looks like this:\n", "\n", "```\n", "https://www.walmart.com/search/?query=water&cat_id=0\n", "```\n", "\n", "It's easy to scrape! If you wanted to search for `guns` instead, you'd just change `water` to `guns` in the URL and off you go. This nice way of living is **parameters in the query string**." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sam's Choice Purified Drinking Water, 10 fl oz, 12 pack\n", "Nestlé Pure Life Purified Water 12 x 16.9fl oz (202.8fl oz)\n", "Fiji Natural Artesian Water, 6pk\n", "Gerber Pure Purified Water, 1.0 GAL\n", "ArrowHead 100% Mountain Spring Water 6 x 23.7 fl oz (142.2 fl oz)\n" ] } ], "source": [ "# Get the page\n", "response = requests.get(\"https://www.walmart.com/search/?query=water&cat_id=0\")\n", "doc = BeautifulSoup(response.text, 'html.parser')\n", "\n", "# Grab all of the titles\n", "title_tags = doc.find_all(class_='prod-ProductTitle')\n", "\n", "# Let's print the first 5\n", "for title in title_tags[:5]:\n", " print(title.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*But it isn't always like that.*\n", "\n", "## But then: POST Forms\n", "\n", "But for most forms, though, it isn't that easy. You type in your info, you click \"Search\", and *there's nothing in the URL.* For example, try searching at [California's Engineer License Database](http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG&p_qte_pgm_code=7500).\n", "\n", "The URL you end up at is something like `http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery`, which doesn't mean *anything*. No parameters in that query string!\n", "\n", "If you search through the browser you see a lot of table rows, but if you try it in Python it doesn't give you anything." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the page\n", "response = requests.get(\"http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery\")\n", "doc = BeautifulSoup(response.text, 'html.parser')\n", "\n", "# Grab all of the rows\n", "row_tags = doc.find_all('tr')\n", "\n", "# Let's print the first 5\n", "for row in row_tags[:5]:\n", " print(row.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nothing at all! What did it give us? Let's look at `response.text`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n\\nLicense Holders : \\n\\n\\n

License Holders :

\\n

\\nError!
\\nThe following unhandled error has occurred in the routine WLLQRYNA$LCEV2.ActionQuery:\\n

\\nORA-01403: no data found\\n

\\nPlease contact your support representative.\\n

\\n\\n\\n'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you read closely, **that's an error**. It's because **we didn't send it any search data.**\n", "\n", "> Looking at `response.text` is THE BEST WAY to find out whether your search worked. You can ctrl+f or just visually search for words you know should be on the page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding our form data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we clicked \"Search,\" it also sent the server a bunch of data - all of the options we typed in, or the dropdowns we selected. Here are the steps to find out what data needs to be sent along with your request.\n", "\n", "We're going to use Chrome's **Network tools** to analyze all of the requests our browser sends to the server, then imitate them in Python.\n", "\n", "1. Open up **Developer Tools** in Chrome by selecting `View > Developer > Developer Tools`.\n", "2. Select the **Network Tab**\n", "3. Visit the page you're going to do your search from\n", "4. Click the **Clear** button up top - 🚫 - then submit your form\n", "5. The Network tab will fill with activity!\n", "6. Find the thing in the Network tab that looks like the same name as your webpage. Click it.\n", "7. On the right-hand side you get a new pane. If you scroll allllll the way down it lists **Form Data**.\n", "\n", "This **Form Data** is what we need to send along with our request. We just need to convert it to a dictionary and send it along." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sending data with the form request\n", "\n", "Once we've converted our form data into a dictionary, we need to make sure of two things:\n", "\n", "1. We're using `requests.post` to make our request\n", "2. We're sending the form data with the request\n", "\n", "Normal browser requests are sent as `GET` requests, but these very fancy ones are sent as `POST`. `POST` just means \"hey I'm sending extra data along with this.\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NameTypeNumberStatusAddressCityZipCounty\n", "SMITH A M OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES\n", "SMITH ALLEN EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES\n", "SMITH ALVIN JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES\n", "SMITH ARTHUR KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES\n" ] } ], "source": [ "data = {\n", " 'P_QTE_CODE': 'ENG',\n", " 'P_QTE_PGM_CODE': '7500',\n", " 'P_LAST_NAME': 'smith',\n", " 'P_FIRST_NAME': '',\n", " 'P_INITIAL': '',\n", " 'P_LICENSE_NUM': '',\n", " 'P_CITY': '',\n", " 'P_COUNTY': 'LOS ANGELES',\n", " 'P_RECORD_SET_SIZE': '50',\n", " 'Z_ACTION': 'Find'\n", "}\n", "\n", "# Get the page\n", "# use .post\n", "# send the data\n", "url = \"http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery\"\n", "response = requests.post(url, data=data)\n", "doc = BeautifulSoup(response.text, 'html.parser')\n", "\n", "# Grab all of the rows\n", "row_tags = doc.find_all('tr')\n", "\n", "# Let's print the first 5\n", "for row in row_tags[:5]:\n", " print(row.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we didn't know if it worked or not, we could also check the response by looking at `response.text`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sending headers with your request\n", "\n", "Sometimes that isn't enough! Some web servers check to make sure you're a real browser, or you came from their site, or other stuff like that.\n", "\n", "**We don't need to do this for the Engineers page, but I'm going to do it anyway.**\n", "\n", "When you send a request, you also send thing called \"Headers.\" You can see the headers inside of the same Network tab part where you found Form Data. It's listed as **Request Headers** - *ignore the response headers*.\n", "\n", "### Pretending to be the browser\n", "\n", "The most common thing you'll need to do is impersonate a browser by sending a `User-Agent` string. If we wanted to visit Columbia's website pretending to be Chrome, we might do this:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "headers = {\n", " 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'\n", "}\n", "requests.get(\"http://journalism.columbia.edu\", headers=headers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding the appropriate headers\n", "\n", "Sometimes pretending to be the browser just isn't enough. If you want to 100% imitate your browser when sending a request, you need to copy aaaaalllll of the headers from the request.\n", "\n", "It's just above the Form Data information, but I'll tell you how to find it again just to be sure:\n", "\n", "1. Open up **Developer Tools** in Chrome by selecting `View > Developer > Developer Tools`.\n", "2. Select the **Network Tab**\n", "3. Visit the page you're going to do your search from\n", "4. Click the **Clear** button up top - 🚫 - then submit your form\n", "5. The Network tab will fill with activity!\n", "6. Find the thing in the Network tab that looks like the same name as your webpage. Click it.\n", "7. On the right-hand side you get a new pane. If you scroll near to the bottom it shows you **Request Headers**.\n", "\n", "You just need to convert these into a dictionary, and send them along with your request.\n", "\n", "### Sending the appropriate headers\n", "\n", "I just checked my results for the Engineers bit. It has a lot of headers!\n", "\n", "```Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\n", "Accept-Encoding:gzip, deflate\n", "Accept-Language:en-US,en;q=0.8\n", "Cache-Control:max-age=0\n", "Connection:keep-alive\n", "Content-Length:156\n", "Content-Type:application/x-www-form-urlencoded\n", "Host:www2.dca.ca.gov\n", "Origin:http://www2.dca.ca.gov\n", "Referer:http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG&p_qte_pgm_code=7500\n", "Upgrade-Insecure-Requests:1\n", "User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36```\n", "\n", "I'm usually too lazy to copy all of them so I only take the ones I think I need, but if you'd like to it's probably easier than the weird `curl` thing I talked about in class.\n", "\n", "Let's make a request using both **headers and POST data**." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NameTypeNumberStatusAddressCityZipCounty\n", "SMITH A M OM2554CANCELLED2245 ASHBOURNE DRSAN MARINO91108LOS ANGELES\n", "SMITH ALLEN EL2352CANCELLED713 N CALIFORNIA STBURBANK91505LOS ANGELES\n", "SMITH ALVIN JE490DECEASED5004 RAMSDELL AVELA CRESCENTA91214LOS ANGELES\n", "SMITH ARTHUR KERMITCS3124CANCELLED28803 CEDARBLUFF DRRANCHO PALOS VERDES90275LOS ANGELES\n" ] } ], "source": [ "# Here are all of our headers\n", "headers = {\n", " 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',\n", " 'Accept-Encoding':'gzip, deflate',\n", " 'Accept-Language':'en-US,en;q=0.8',\n", " 'Cache-Control':'max-age=0',\n", " 'Connection':'keep-alive',\n", " 'Content-Length':'156',\n", " 'Content-Type':'application/x-www-form-urlencoded',\n", " 'Host':'www2.dca.ca.gov',\n", " 'Origin':'http://www2.dca.ca.gov',\n", " 'Referer':'http://www2.dca.ca.gov/pls/wllpub/wllqryna$lcev2.startup?p_qte_code=ENG&p_qte_pgm_code=7500',\n", " 'Upgrade-Insecure-Requests':'1',\n", " 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'\n", "}\n", "\n", "# Here is the form data\n", "data = {\n", " 'P_QTE_CODE': 'ENG',\n", " 'P_QTE_PGM_CODE': '7500',\n", " 'P_LAST_NAME': 'smith',\n", " 'P_FIRST_NAME': '',\n", " 'P_INITIAL': '',\n", " 'P_LICENSE_NUM': '',\n", " 'P_CITY': '',\n", " 'P_COUNTY': 'LOS ANGELES',\n", " 'P_RECORD_SET_SIZE': '50',\n", " 'Z_ACTION': 'Find'\n", "}\n", "\n", "# Get the page\n", "# use .post\n", "# send the data\n", "# send the headers\n", "url = \"http://www2.dca.ca.gov/pls/wllpub/WLLQRYNA$LCEV2.ActionQuery\"\n", "response = requests.post(url, data=data, headers=headers)\n", "doc = BeautifulSoup(response.text, 'html.parser')\n", "\n", "# Grab all of the rows\n", "row_tags = doc.find_all('tr')\n", "\n", "# Let's print the first 5\n", "for row in row_tags[:5]:\n", " print(row.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perfect! By learning how `.post` requests, form data and headers work, you're now going to be able to scrape a lot of very difficult sites." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }