If you wanted to run some code every hour - to scrape a web site, for example! - it would get rather boring rather quickly if you had to do it yourself. And besides how boring it would be, you’d never be able to sleep more than 45 minutes at a time!

If you want to do tasks on a regular schedule, cron jobs are your solution.

We don’t talk about doing this on Windows, because if you’re doing something regularly you presumably want it running when you’re asleep, or when your computer is off. That means a server, and servers mean Linux!

Updating our cron jobs

The program cron eats up a specially-formatted file that tells the server what to run and how often to run it. It’s not the most easy thing to use, so we’re going to have crontab help us out.

Run crontab -e to open up the cron editor.

crontab -e

If it asks you what editor you want, thank it for the politeness and hit enter: nano is default. It’s kind of a weird editor, but at least it lists the commands on the screen.

Once you’re in the editor, type (or paste) this in there:

*/2 * * * * curl http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt

Once you’re done, hit Ctrl+X to exit, and on your way out press Y to save changes and hit Enter to keep the same filename.

The * * * * pattern is explained to you in the crontab editor, I think! It’s how oten the task gets run.

How would we know if we had an error? Let’s wait a few minutes. If you’re getting antsy about when our task will run, type date. We’re looking for a divisible-by-two minute.

Hit enter if you get bored.

Checking your mail and silencing curl

Eventually the machine says You have new mail. Check it with… mail.

mail

You’re selecting the most recent message, hit enter. Oh, look, the output of curl, how sweet.

Exit with x.

Whenever you run a command, the output gets mailed to you. We… don’t really want this. The simplest way for now is to make curl silent by using the --silent flag.

crontab -e

Then change your line to have --silent with curl.

*/2 * * * * curl --silent http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt

Now you won’t be hassled any more.

This mail is also how you’ll know if you have an error in your code!

Scraping repeatedly

What we’ve done so far is pretty boring - we’ve just saved a file. What we really want is to run Python scripts!

Let’s use this file to scrape some headlines from the New York Times. Create it on your own computer, and save it as scraper.py.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

response = requests.get("http://www.nytimes.com"../)
doc = BeautifulSoup(response.text, 'html.parser')

stories = doc.find_all("article", { 'class': 'story' })

all_stories = []
# Grab their headlines and bylines
for story in stories:
    # Grab all of the h2's inside of the story
    headline = story.find('h2', {'class': 'story-heading'})
    # If a headline exists, then process the rest!
    if headline:
        # They're COVERED in whitespace
        headline_text = headline.text.strip()
        # Make a dictionary with the headline
        this_story = { 'headline': headline_text }
        byline = story.find('p', {'class': 'byline'})
        # Not all of them have a byline
        if byline:
            byline_text = byline.text.strip()
            this_story['byline'] = byline_text
        all_stories.append(this_story)

all_stories

stories_df = pd.DataFrame(all_stories)
stories_df.to_csv("nyt-data.csv")

datestring = time.strftime("%Y-%m-%d-%H-%M")

filename = "nyt-data-" + datestring + ".csv"
stories_df.to_csv(filename, index=False)

Once it’s saved, you need to send it to your server. We’ll use scp for that.

scp -i ~/.ssh/algorithms_key scraper.py root@YOUR_IP:~/

Testing out the script

Now that the script is on our server, we can try to run it with python3 scraper.py. If you need to install more packages (requests, etc), install them!