Saving data when scraping¶
When saving data when scraping (especially with automated scrapers), there are two approaches
- The replacement method, which replaces the previous dataset
- The append method, which adds to the previous dataset
The replacement method¶
This one is the easiest: you just overwrite the old data.
Let's say you have a dataframe called df
that is your scraped data, and you want to save it as output.csv
. To replace the old data, you just save it as normal.
Done!
The append method¶
The append method adds to an existing CSV file.
Let's say you have a dataframe called df
that is your scraped data, and you want to save it as output.csv
. We have two approaches.
Prep work
For both options you'll need to have an existing CSV file to append to. The easiest solution is to just run the script first as a normal df.to_csv
, then add in the append code.
Method 1: Literal appending¶
This approach uses two special options to .to_csv
to make sure that the new rows are simply added to the bottom of the file.
# Add a timestamp to your new rows of data so you
# know when they were scraped at (see below)
df['scraped_at'] = pd.Timestamp.today().to_period('D')
# `mode='a'` adds new rows
# header=False does not include the header
df.to_csv('output.csv', mode='a', header=False, index=False)
Method 2: Manually combining¶
I like this approach a bit more, even though it's more complicated! By manually combining the previous and new data, you have the opportunity to remove duplicate rows in case you accidentally ran it twice.
# Add a timestamp to your new rows of data so you
# know when they were scraped at (see below)
df['scraped_at'] = pd.Timestamp.today().to_period('D')
# Read in your old data
old_df = pd.read_csv('output.csv')
# Combine the two
merged = pd.concat([old_df, df], ignore_index=True)
# You don't necessarily need this!
merged = merged.drop_duplicates()
# Overwrite the old data
merged.to_csv('output.csv', index=False)
Other options for to_period
¶
Adding a scraped_at
column allows you to know when the data was scraped, and .to_period
is a convenient way to format the time in a nice fashion. It's a lot simpler than using normal Python code like datetime.datetime
and strftime
.
If your code is running hourly/minutely/weekly or anything other than daily, pandas' offset strings will allow you to pick a different date.
period name | description | code | example |
---|---|---|---|
T |
minute | pd.Timestamp.today().to_period('T') |
`2023-01-04 15:15 |
H |
hour | pd.Timestamp.today().to_period('H') |
2023-01-04 15:00 |
D |
day | pd.Timestamp.today().to_period('D') |
2023-01-04 |
W |
week | pd.Timestamp.today().to_period('W') |
2023-01-02/2023-01-08 |
M |
month | pd.Timestamp.today().to_period('M') |
2023-01 |
Y |
year | pd.Timestamp.today().to_period('Y') |
2023 |