Let’s say we want to save a CSV of data from H&M.
Before we get started we’ll do all the normal imports. Notice we’re also importing pandas! We’re going to use pandas to save our content as a CSV once we’re done scraping. Why else would we scrape anything, if not to save it?
First, we’ll just visit the page as usual. In this case H&M is trying to protect itself from bots, so we’re pretending we’re a totally normal human being.
We then use the inspector to find out that the class of each product is item-
heading
and the price of each product is item-price
Knit Throw with Fringe
Patterned Duvet Cover
Cotton Pillowcase
$29.99
$59.99
$44.99
$119.00
$9.99
$17.99
Converting to a list of dictionaries
The problem is that we want to keep the first name attached to the first price,
and the second name attached to the second price, and the third name attached to
the third price. Right now they’re in two separate lists, when want we really
want is one list, where each element has a name
and a price
. Like a list of
dictionaries, right?
First, let’s work on building our dictionaries. Instead of selecting all of the names and all of the prices, we need to figure out thing container that has the name and the price inside.
Basically “find the thing that surrounds every item”. Now, instead of finding each name or each price or whatever, we’re going to find each one of these blocks.
----this is an item------
SAVE AS FAVORITE
Knit Throw with Fringe
$29.99
$59.99
Dark gray
----this is an item------
SAVE AS FAVORITE
CLASSIC COLLECTION
Patterned Duvet Cover
$44.99
$119.00
White/striped
See? It has all of the information inside of it! Name, price, even the collection and the colorways. But we need it organized, not just in a weird random string.
We’re going to change what we do in the loop. Right now we just print out everything inside of the block. Instead, we’re going to just find the name, and then just find the price. It’s just like what we were doing before when we found all of the names, but we’re only looking for the one inside of each block, not across the whole page.
----this is an item------
Knit Throw with Fringe $29.99
$59.99
----this is an item------
Patterned Duvet Cover $44.99
$119.00
----this is an item------
Cotton Pillowcase $9.99
$17.99
----this is an item------
Pillowcase with Pin-tucks $9.99
$17.99
----this is an item------
Linen-blend Bedspread $54.99
$99.00
Notice we’re doing item.find
, not doc.find
! Just like we usually use .text
to get the text of an element, .find
will only find the pieces inside of it.
If that doesn’t make sense, it’s ok to just memorize it! Use .find_all
to find
the big blocks, then use .find
to find the individual pieces inside.
Now, we’re looking to put together some dictionaries. Each product will be a row in the CSV we want to create. What is each column? Oh, name and price - the same as the things we’re printing out! We’re going to make a dictionary out of them, where each key ends up being a column in our CSV.
----this is an item------
{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}
----this is an item------
{'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}
----this is an item------
{'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}
Now that we’ve got these dictionaries, we need to save them as we go along. Let’s make an empty list, and every time we look at a new product we can save it to the list.
----this is an item------
{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}
----this is an item------
{'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}
----this is an item------
{'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}
----this is an item------
{'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}
------
Final list: [{'name': 'Knit Throw with Fringe', 'price': '$29.99\n$59.99'}, {'name': 'Patterned Duvet Cover', 'price': '$44.99\n$119.00'}, {'name': 'Cotton Pillowcase', 'price': '$9.99\n$17.99'}, {'name': 'Pillowcase with Pin-tucks', 'price': '$9.99\n$17.99'}, {'name': 'Linen-blend Bedspread', 'price': '$54.99\n$99.00'}]
Okay, cool, a list of dictionaries. But what we are going to do with it?
Convert it into a dataframe with pandas, of course! Pandas will easily take a list of dictionaries and save it right into a dataframe.
name | price | |
---|---|---|
0 | Knit Throw with Fringe | $29.99\n$59.99 |
1 | Patterned Duvet Cover | $44.99\n$119.00 |
2 | Cotton Pillowcase | $9.99\n$17.99 |
3 | Pillowcase with Pin-tucks | $9.99\n$17.99 |
4 | Linen-blend Bedspread | $54.99\n$99.00 |
Now we just need to save it to a CSV. Just remember to do index=False
so that
it gets saved without the weird nameless index column!