If you wanted to run some code every hour - to scrape a web site, for example! - it would get rather boring rather quickly if you had to do it yourself. And besides how boring it would be, you’d never be able to sleep more than 45 minutes at a time!

If you want to do tasks on a regular schedule, cron jobs are your solution.

We don’t talk about doing this on Windows, because if you’re doing something regularly you presumably want it running when you’re asleep, or when your computer is off. That means a server, and servers mean Linux!

Updating our cron jobs

The program cron eats up a specially-formatted file that tells the server what to run and how often to run it. It’s not the most easy thing to use, so we’re going to have crontab help us out.

First we need to tell crontab what editor we’re going to want to use. You’ll only do this line once.

echo "export EDITOR=nano" >> ~/.bash_profile
source ~/.bash_profile

Then, run crontab -e to open up the cron editor.

crontab -e

Once you’re in the editor, type (or paste) this in there:

*/2 * * * * curl http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt

Once you’re done, hit Ctrl+O and enter to save, then Ctrl+X to edit.

The * * * * pattern is conveniently explained to you in the crontab editor! It’s how oten the task gets run.

How would we know if we had an error? Let’s wait a few minutes. If you’re getting antsy about when our task will run, type date. We’re looking for a divisible-by-two minute.

Hit enter if you get bored.

Checking your mail and silencing curl

Eventually the machine says You have new mail. Check it with… mail.

mail

You’re selecting the most recent message, hit enter. Oh, look, the output of curl, how sweet.

Exit with x.

Whenever you run a command, the output gets mailed to you. We… don’t really want this. The simplest way for now is to make curl silent by using the --silent flag.

crontab -e

Then change your line to have --silent with curl.

*/2 * * * * curl --silent http://www.bbc.com > ~/bbc-`date +"\%s"`.txt

Now you won’t be hassled any more.

This mail is also how you’ll know if you have an error in your code!

Scraping repeatedly

What we’ve done so far is pretty boring - we’ve just saved a file. What we really want is to run Python scripts!

Let’s use this file to scrape some headlines from the New York Times. Create it on your own computer, and save it as scraper.py.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

response = requests.get("http://www.bbc.com"../)
doc = BeautifulSoup(response.text, 'html.parser')

stories = doc.find_all(class_='media-list__item')

rows = []
# Grab their headlines and bylines
for story in stories:
    row = {}
    # Grab all of the h2's inside of the story
    headline = story.find(class_='media__link')
    # If a headline exists, then process the rest!
    if headline:
        # They're COVERED in whitespace
        row['headline'] = headline.text.strip()
        # Get the URL
        row['link'] = headline['href']
        try:
            row['summary'] = story.find(class_='media__summary').text.strip()
        except:
            pass

        rows.append(row)

# Create our dataframe
df = pd.DataFrame(rows)
df.to_csv("bbc-data.csv")

# No wait, let's include the time in the filename
datestring = time.strftime("%Y-%m-%d-%H-%M")
filename = f"bbc-data-{datestring}.csv"

# Save it
df.to_csv(filename, index=False)

Once it’s saved, you need to send it to your server. We’ll use scp for that.

scp -i ~/.ssh/foundations_key scraper.py root@YOUR_IP:~/

Testing out the script

Now that the script is on our server, we can try to run it with python3 scraper.py. If you need to install more packages (requests, etc), install them!