Skip to content

Protecting automatic scrapers from duplicate content

A common situation for automatic scrapers is scraping new content every day. Maybe 5AM rolls around and your scraper wakes up, downloads the day's financial data, and goes back to sleep for 24 hours.

The problem: re-running scrapers

But maybe we're testing our script, or click 'run' on GitHub Actions, or something happens and we accidentally run it a few times during the middle of the day, not just at 5AM when it's supposed to run.

Oh no! Every time we ran our script, it added a new row of data to our CSV file. It's supposed to only have one entry for each date, so that's bad bad duplicate data!

If we wanted bad solutions, we might think about:

  • We could edit it out when doing our analysis, but that seems like a lot to remember
  • We could edit it manually, but that seems kind of inappropriate for people who are programming
  • We could somehow make it not run if it's already been run once that day, but maybe sometimes we need to re-run it manually

So what can we do to protect the integrity of our data? Don't worry, it's easy!

The solution: dropping duplicates

A simple fix is to make your data contain an updated_at column, then drop duplicate rows based on that column.

# Run your scraper, build a list
# of dictionaries called `data`
data = [
    { 'price': 3 },
    { 'price': 4 },
    { 'price': 5 }
]
new_df = pd.DataFrame(data)

# Timestamp should be daily
new_df['updated_at'] = pd.Timestamp.today().to_period('D')

# Combine old data with new data
old_df = pd.read_csv("finances.csv")
df = pd.concat([old_df, new_df], ignore_index=True)

# Remove repeats based on 'updated_at' column
df = df.drop_duplicates(subset='updated_at')

# Save
df.to_csv("finances.csv", index=False)

By default, drop_duplicates keeps the first (original) version of the row. If you'd like it to keep the more recent one, add keep='last'.

If you'd like to change the column name, subset= should be the column that tells pandas what counts as duplicate data.

Why not datetime.date.today() or datetime.now()?

If you try to use a specific time, things like minutes and seconds suddenly matter, and you won't be able to find matches with drop_duplicates.

For example, if my first job finishes 5 minutes before my second job, updated_at might be 2023-01-04 15:25:05 for the first job and 2023-01-04 15:35:12 for the second!

Yes, you could use strftime to manually put together the same thing as .to_period, but that feels like a little too much work.

Other options for to_period

If your code is running hourly/minutely/weekly or anything other than daily, pandas' offset strings will allow you to pick a different date

period name description code example
T minute pd.Timestamp.today().to_period('T') `2023-01-04 15:15
H hour pd.Timestamp.today().to_period('H') 2023-01-04 15:00
D day pd.Timestamp.today().to_period('D') 2023-01-04
W week pd.Timestamp.today().to_period('W') 2023-01-02/2023-01-08
M month pd.Timestamp.today().to_period('M') 2023-01
Y year pd.Timestamp.today().to_period('Y') 2023

Preserving missing data and NaN values

If you added this after the script has already been running for a while, maybe all of your old rows have missing updated_at columns. The code above combines all of the NaN since they're technically duplicates!

We'll use df.duplicated in our solution below:

# Combine with old data
old_df = pd.read_csv("finances.csv")
df = pd.concat([old_df, new_df], ignore_index=True)

# Preserve 
df = df[df.updated_at.isna() | ~df.duplicated(subset='updated_at')]

# Save
df.to_csv("finances.csv", index=False)

Using df.updated_at.isna() will preserve all of our rows with missing update times, while df.duplicated returns False for each duplicated row.