Scraping into pandas

Let’s say we want to save a CSV of data from H&M.

Before we get started we’ll do all the normal imports. Notice we’re also importing pandas! We’re going to use pandas to save our content as a CSV once we’re done scraping. Why else would we scrape anything, if not to save it?

import requests
import pandas as pd
from bs4 import BeautifulSoup

First, we’ll just visit the page as usual. In this case H&M is trying to protect itself from bots, so we’re pretending we’re a totally normal human being.

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

response = requests.get('https://www2.hm.com/en_us/sale/home/view-all.html', headers=headers)
doc = BeautifulSoup(response.text)

We then use the inspector to find out that the class of each product is item- heading and the price of each product is item-price

# Using [:3] to only go through the first 3
names = doc.find_all(class_="item-heading")
for name in names[:3]:
    print(name.text.strip())

Knit Throw with Fringe
Patterned Duvet Cover
Cotton Pillowcase

# Using [:3] to only go through the first 3
# (it looks like more because 2 prices per item)
prices = doc.find_all(class_="item-price")
for price in prices[:3]:
    print(price.text.strip())

$29.99
$59.99
$44.99
$119.00
$9.99
$17.99

Converting to a list of dictionaries

The problem is that we want to keep the first name attached to the first price, and the second name attached to the second price, and the third name attached to the third price. Right now they’re in two separate lists, when want we really want is one list, where each element has a name and a price. Like a list of dictionaries, right?

First, let’s work on building our dictionaries. Instead of selecting all of the names and all of the prices, we need to figure out thing container that has the name and the price inside.

Basically “find the thing that surrounds every item”. Now, instead of finding each name or each price or whatever, we’re going to find each one of these blocks.

# Using [:3] to only go through the first 3
items = doc.find_all(class_="hm-product-item")
for item in items[:2]:
    print("----this is an item------")
    print(item.text.strip())

----this is an item------
SAVE AS FAVORITE



Knit Throw with Fringe

$29.99
$59.99



				Dark gray
----this is an item------
SAVE AS FAVORITE



		CLASSIC COLLECTION

Patterned Duvet Cover

$44.99
$119.00



				White/striped

See? It has all of the information inside of it! Name, price, even the collection and the colorways. But we need it organized, not just in a weird random string.

We’re going to change what we do in the loop. Right now we just print out everything inside of the block. Instead, we’re going to just find the name, and then just find the price. It’s just like what we were doing before when we found all of the names, but we’re only looking for the one inside of each block, not across the whole page.

# Using [:5] to only go through the first 5
items = doc.find_all(class_="hm-product-item")
for item in items[:5]:
    print("----this is an item------")
    name = item.find(class_='item-heading').text.strip()
    price = item.find(class_='item-price').text.strip()
    print(name, price)

----this is an item------
Knit Throw with Fringe $29.99
$59.99
----this is an item------
Patterned Duvet Cover $44.99
$119.00
----this is an item------
Cotton Pillowcase $9.99
$17.99
----this is an item------
Pillowcase with Pin-tucks $9.99
$17.99
----this is an item------
Linen-blend Bedspread $54.99
$99.00

Notice we’re doing item.find, not doc.find! Just like we usually use .text to get the text of an element, .find will only find the pieces inside of it.

If that doesn’t make sense, it’s ok to just memorize it! Use .find_all to find the big blocks, then use .find to find the individual pieces inside.

Now, we’re looking to put together some dictionaries. Each product will be a row in the CSV we want to create. What is each column? Oh, name and price - the same as the things we’re printing out! We’re going to make a dictionary out of them, where each key ends up being a column in our CSV.

# Find each product block
items = doc.find_all(class_="hm-product-item")

# Go through each of the blocks... (well, [:5] means the first 5)
for item in items[:5]:
    print("----this is an item------")

    # Create an empty row for our CSV file 
    row = {}
    
    # Fill in the 'name' and 'price' headers
    row['name'] = item.find(class_='item-heading').text.strip()
    row['price'] = item.find(class_='item-price').text.strip()

    # Print it out to double-check
    print(row)

----this is an item------
{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}
----this is an item------
{'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}
----this is an item------
{'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}

Now that we’ve got these dictionaries, we need to save them as we go along. Let’s make an empty list, and every time we look at a new product we can save it to the list.

# Find each product block
items = doc.find_all(class_="hm-product-item")

# A list of rows. Each row will be a row in our final CSV
# We start without any!
rows = []

# Go through each of the blocks... (well, [:5] means the first 5)
for item in items[:5]:
    print("----this is an item------")

    # Create an empty row for our CSV file 
    row = {}
    
    # Fill in the 'name' and 'price' headers
    row['name'] = item.find(class_='item-heading').text.strip()
    row['price'] = item.find(class_='item-price').text.strip()

    # Now that we've filled in our row, add it to our list
    rows.append(row)
    
    # Print it out to double-check
    print(row)

print("------")
print("Final list:",rows)

----this is an item------
{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}
----this is an item------
{'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}
----this is an item------
{'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}
------
Final list: [{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}, {'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}, {'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}, {'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}, {'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}]

Okay, cool, a list of dictionaries. But what we are going to do with it?

Convert it into a dataframe with pandas, of course! Pandas will easily take a list of dictionaries and save it right into a dataframe.

# Find each product block
items = doc.find_all(class_="hm-product-item")

# A list of rows. Each row will be a row in our final CSV
# We start without any!
rows = []

# Go through each of the blocks... (well, [:5] means the first 5)
for item in items:
    # Create an empty row for our CSV file 
    row = {}
    
    # Fill in the 'name' and 'price' headers
    row['name'] = item.find(class_='item-heading').text.strip()
    row['price'] = item.find(class_='item-price').text.strip()

    # Now that we've filled in our row, add it to our list
    rows.append(row)

df = pd.DataFrame(rows)
df.head()

	name	price
0	Knit Throw with Fringe	$29.99\n$59.99
1	Patterned Duvet Cover	$44.99\n$119.00
2	Cotton Pillowcase	$9.99\n$17.99
3	Pillowcase with Pin-tucks	$9.99\n$17.99
4	Linen-blend Bedspread	$54.99\n$99.00

Now we just need to save it to a CSV. Just remember to do index=False so that it gets saved without the weird nameless index column!

df.to_csv("../scraped.csv", index=False)