Protecting automatic scrapers from duplicate content¶
A common situation for automatic scrapers is scraping new content every day. Maybe 5AM rolls around and your scraper wakes up, downloads the day's financial data, and goes back to sleep for 24 hours.
The problem: re-running scrapers¶
But maybe we're testing our script, or click 'run' on GitHub Actions, or something happens and we accidentally run it a few times during the middle of the day, not just at 5AM when it's supposed to run.
Oh no! Every time we ran our script, it added a new row of data to our CSV file. It's supposed to only have one entry for each date, so that's bad bad duplicate data!
If we wanted bad solutions, we might think about:
- We could edit it out when doing our analysis, but that seems like a lot to remember
- We could edit it manually, but that seems kind of inappropriate for people who are programming
- We could somehow make it not run if it's already been run once that day, but maybe sometimes we need to re-run it manually
So what can we do to protect the integrity of our data? Don't worry, it's easy!
The solution: dropping duplicates¶
A simple fix is to make your data contain an updated_at
column, then drop duplicate rows based on that column.
# Run your scraper, build a list
# of dictionaries called `data`
data = [
{ 'price': 3 },
{ 'price': 4 },
{ 'price': 5 }
]
new_df = pd.DataFrame(data)
# Timestamp should be daily
new_df['updated_at'] = pd.Timestamp.today().to_period('D')
# Combine old data with new data
old_df = pd.read_csv("finances.csv")
df = pd.concat([old_df, new_df], ignore_index=True)
# Remove repeats based on 'updated_at' column
df = df.drop_duplicates(subset='updated_at')
# Save
df.to_csv("finances.csv", index=False)
By default, drop_duplicates
keeps the first (original) version of the row. If you'd like it to keep the more recent one, add keep='last'
.
If you'd like to change the column name, subset=
should be the column that tells pandas what counts as duplicate data.
Why not datetime.date.today()
or datetime.now()
?¶
If you try to use a specific time, things like minutes and seconds suddenly matter, and you won't be able to find matches with drop_duplicates
.
For example, if my first job finishes 5 minutes before my second job, updated_at
might be 2023-01-04 15:25:05
for the first job and 2023-01-04 15:35:12
for the second!
Yes, you could use strftime
to manually put together the same thing as .to_period
, but that feels like a little too much work.
Other options for to_period
¶
If your code is running hourly/minutely/weekly or anything other than daily, pandas' offset strings will allow you to pick a different date
period name | description | code | example |
---|---|---|---|
T |
minute | pd.Timestamp.today().to_period('T') |
`2023-01-04 15:15 |
H |
hour | pd.Timestamp.today().to_period('H') |
2023-01-04 15:00 |
D |
day | pd.Timestamp.today().to_period('D') |
2023-01-04 |
W |
week | pd.Timestamp.today().to_period('W') |
2023-01-02/2023-01-08 |
M |
month | pd.Timestamp.today().to_period('M') |
2023-01 |
Y |
year | pd.Timestamp.today().to_period('Y') |
2023 |
Preserving missing data and NaN
values¶
If you added this after the script has already been running for a while, maybe all of your old rows have missing updated_at
columns. The code above combines all of the NaN
since they're technically duplicates!
We'll use df.duplicated
in our solution below:
# Combine with old data
old_df = pd.read_csv("finances.csv")
df = pd.concat([old_df, new_df], ignore_index=True)
# Preserve
df = df[df.updated_at.isna() | ~df.duplicated(subset='updated_at')]
# Save
df.to_csv("finances.csv", index=False)
Using df.updated_at.isna()
will preserve all of our rows with missing update times, while df.duplicated
returns False
for each duplicated row.