Saving data when scraping¶

When saving data when scraping (especially with automated scrapers), there are two approaches

The replacement method, which replaces the previous dataset
The append method, which adds to the previous dataset

The replacement method¶

This one is the easiest: you just overwrite the old data.

Let's say you have a dataframe called df that is your scraped data, and you want to save it as output.csv. To replace the old data, you just save it as normal.

df.to_csv("output.csv", index=False)

Done!

The append method¶

The append method adds to an existing CSV file.

Let's say you have a dataframe called df that is your scraped data, and you want to save it as output.csv. We have two approaches.

Prep work

For both options you'll need to have an existing CSV file to append to. The easiest solution is to just run the script first as a normal df.to_csv, then add in the append code.

Method 1: Literal appending¶

This approach uses two special options to .to_csv to make sure that the new rows are simply added to the bottom of the file.

# Add a timestamp to your new rows of data so you
# know when they were scraped at (see below)
df['scraped_at'] = pd.Timestamp.today().to_period('D')

# `mode='a'` adds new rows
# header=False does not include the header
df.to_csv('output.csv', mode='a', header=False, index=False)

Method 2: Manually combining¶

I like this approach a bit more, even though it's more complicated! By manually combining the previous and new data, you have the opportunity to remove duplicate rows in case you accidentally ran it twice.

# Add a timestamp to your new rows of data so you
# know when they were scraped at (see below)
df['scraped_at'] = pd.Timestamp.today().to_period('D')

# Read in your old data
old_df = pd.read_csv('output.csv')

# Combine the two
merged = pd.concat([old_df, df], ignore_index=True)

# You don't necessarily need this!
merged = merged.drop_duplicates()

# Overwrite the old data
merged.to_csv('output.csv', index=False)

Other options for `to_period`¶

Adding a scraped_at column allows you to know when the data was scraped, and .to_period is a convenient way to format the time in a nice fashion. It's a lot simpler than using normal Python code like datetime.datetime and strftime.

If your code is running hourly/minutely/weekly or anything other than daily, pandas' offset strings will allow you to pick a different date.

period name	description	code	example
`T`	minute	`pd.Timestamp.today().to_period('T')`	`2023-01-04 15:15
`H`	hour	`pd.Timestamp.today().to_period('H')`	`2023-01-04 15:00`
`D`	day	`pd.Timestamp.today().to_period('D')`	`2023-01-04`
`W`	week	`pd.Timestamp.today().to_period('W')`	`2023-01-02/2023-01-08`
`M`	month	`pd.Timestamp.today().to_period('M')`	`2023-01`
`Y`	year	`pd.Timestamp.today().to_period('Y')`	`2023`