Using cron jobs for repeating tasks
If you wanted to run some code every hour - to scrape a web site, for example! - it would get rather boring rather quickly if you had to do it yourself. And besides how boring it would be, you’d never be able to sleep more than 45 minutes at a time!
If you want to do tasks on a regular schedule, cron jobs are your solution.
We don’t talk about doing this on Windows, because if you’re doing something regularly you presumably want it running when you’re asleep, or when your computer is off. That means a server, and servers mean Linux!
Updating our cron jobs
The program cron
eats up a specially-formatted file that tells the server what to run and how often to run it. It’s not the most easy thing to use, so we’re going to have crontab
help us out.
Run crontab -e
to open up the cron
editor.
crontab -e
If it asks you what editor you want, thank it for the politeness and hit enter: nano
is default. It’s kind of a weird editor, but at least it lists the commands on the screen.
Once you’re in the editor, type (or paste) this in there:
*/2 * * * * curl http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt
Once you’re done, hit Ctrl+X
to exit, and on your way out press Y
to save changes and hit Enter
to keep the same filename.
The * * * * pattern is explained to you in the crontab editor, I think! It’s how oten the task gets run.
How would we know if we had an error? Let’s wait a few minutes. If you’re getting antsy about when our task will run, type date
. We’re looking for a divisible-by-two minute.
Hit enter if you get bored.
Checking your mail and silencing curl
Eventually the machine says You have new mail.
Check it with… mail
.
mail
You’re selecting the most recent message, hit enter. Oh, look, the output of curl
, how sweet.
Exit with x
.
Whenever you run a command, the output gets mailed to you. We… don’t really want this. The simplest way for now is to make curl silent by using the --silent
flag.
crontab -e
Then change your line to have --silent
with curl.
*/2 * * * * curl --silent http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt
Now you won’t be hassled any more.
This mail is also how you’ll know if you have an error in your code!
Scraping repeatedly
What we’ve done so far is pretty boring - we’ve just saved a file. What we really want is to run Python scripts!
Let’s use this file to scrape some headlines from the New York Times. Create it on your own computer, and save it as scraper.py
.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
response = requests.get("http://www.nytimes.com"../)
doc = BeautifulSoup(response.text, 'html.parser')
stories = doc.find_all("article", { 'class': 'story' })
all_stories = []
# Grab their headlines and bylines
for story in stories:
# Grab all of the h2's inside of the story
headline = story.find('h2', {'class': 'story-heading'})
# If a headline exists, then process the rest!
if headline:
# They're COVERED in whitespace
headline_text = headline.text.strip()
# Make a dictionary with the headline
this_story = { 'headline': headline_text }
byline = story.find('p', {'class': 'byline'})
# Not all of them have a byline
if byline:
byline_text = byline.text.strip()
this_story['byline'] = byline_text
all_stories.append(this_story)
all_stories
stories_df = pd.DataFrame(all_stories)
stories_df.to_csv("nyt-data.csv")
datestring = time.strftime("%Y-%m-%d-%H-%M")
filename = "nyt-data-" + datestring + ".csv"
stories_df.to_csv(filename, index=False)
Once it’s saved, you need to send it to your server. We’ll use scp
for that.
scp -i ~/.ssh/algorithms_key scraper.py root@YOUR_IP:~/
Testing out the script
Now that the script is on our server, we can try to run it with python3 scraper.py
. If you need to install more packages (requests, etc), install them!