To create a new server

  1. Open up digital ocean
  2. Create Droplet
  3. Ubuntu 14.04.4 x64
  4. $5/mo
  5. New York is fine
  6. New SSH Key
  7. Paste in our SSH public key, naming it do-droplet (key-creating details below)
  8. Make sure you’re only creating one
  9. You can give it a unique hostname if you’d like. Maybe cronmachine?
  10. Click Create and wait maybe 60 seconds for it to start up.

How to create an SSH key

We’re changing these GitHub directions a little bit. Run the following to start the ssh-keygen ssh key generator.

ssh-keygen -t rsa -b 4096 -C "YOUR_EMAIL@EXAMPLE.COM"

It will say to you:

Enter a file in which to save the key (/Users/YOUR_USERNAME/.ssh/id_rsa):

WE DON’T WANT TO USE THIS. We want to make a new one, but in the same .ssh directory. We’re going to call it do-droplet. We can accomplish that by entering this as the filename:

~/.ssh/do-droplet

Don’t type anything when it asks you for a passphrase, just hit enter twice. It will create two files for you, your public and private keys.

Take a look at your private key with cat. Never give this away!

cat ~/.ssh/do-droplet

Now take a look at your public key, also with cat. This is the one we’ll paste.

cat ~/.ssh/do-droplet.pub

The next two commands will come in handy later, but don’t run them yet.

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/do-droplet

Connecting to our server

Open up the page for your droplet, and make a note of your IP Address. DO refers to it as IPv4. My IP is 107.170.91.48.

Now we need to connect to our server using ssh. We’ll be connecting as root, the A+ best killer cool user.

ssh root@YOUR_IP

There will be a question. The answer is yes. It will ask you for a password. You don’t have a password!

Instead of passwords, we’re using public and private keys. Ctrl+C to quit logging in, and let’s try to log in again using our key (our identity).

ssh -i ~/.ssh/do-droplet root@107.170.91.48

Success!

Running things on our server

Note the error-y message about packages needing updates

pwd

ls

curl http://www.nytimes.com

curl http://www.nytimes.com > nyt.txt

ls

cat nyt.txt

python --version

python3 --version

Let’s update those packages.

apt-get update
apt-get upgrade

Let’s also install a few other packages

apt-get install mailutils

When you get a pink screen asking you about mail setup, pick ….internet.

Our scraper.py file

Save this file on your local machine:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

response = requests.get("http://www.nytimes.com"../)
doc = BeautifulSoup(response.text, 'html.parser')

stories = doc.find_all("article", { 'class': 'story' })

all_stories = []
# Grab their headlines and bylines
for story in stories:
    # Grab all of the h2's inside of the story
    headline = story.find('h2', {'class': 'story-heading'})
    # If a headline exists, then process the rest!
    if headline:
        # They're COVERED in whitespace
        headline_text = headline.text.strip()
        # Make a dictionary with the headline
        this_story = { 'headline': headline_text }
        byline = story.find('p', {'class': 'byline'})
        # Not all of them have a byline
        if byline:
            byline_text = byline.text.strip()
            this_story['byline'] = byline_text
        all_stories.append(this_story)

all_stories

stories_df = pd.DataFrame(all_stories)
stories_df.to_csv("nyt-data.csv")

datestring = time.strftime("%Y-%m-%d-%H-%M")

filename = "nyt-data-" + datestring + ".csv"
stories_df.to_csv(filename, index=False)

Crontab on our server

Now we want to run crontab.

crontab -e

It asks us what editor we want, how nice. Hit enter, nano is default. Enter this somewhere:

*/2 * * * * curl http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt

There’s a \ before the % which wasn’t there before! It’s because cron is slightly different on Ubuntu vs. OS X and it uses a percent symbol as a new line. We need to escape it to say hey, really, we want to use a percent symbol here, not a newline.

Save/exit.

How would we know if we had an error? Let’s wait a few minutes. If you’re getting antsy about when it will be a divisible-by-two minute type date.

Checking your mail and silencing curl

Eventually the machine says You have new mail. Check it with… mail.

mail

You’re selecting the most recent message, hit enter. Oh, look, the output of curl, how sweet.

Exit with x.

Whenever you run a command, the output gets mailed to you. We… don’t really want this. The simplest way for now is to make curl silent by using the --silent flag.

crontab -e

Then change your line to have --silent with curl.

*/2 * * * * curl --silent http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt

Now you won’t be hassled any more. =

This mail is also how you’ll know if you have an error in your code.

Transferring files

Now we want to take our scraper.py and transfer it from our local machine to our server. We’ll be copying it, which is usually cp, but because we’re doing it over ssh it’s scp.

Let’s copy it from our local machine. From your machine, run:

scp -i ~/.ssh/do-droplet scraper.py root@YOUR_IP:~/

This will send a file from your local machine to your remote machine, while logging in with your identity file (private key).

Setting up Python

Back on your machine, try to run your script using python3.

python3 scraper.py

In return, you get a joyous error message!

Traceback (most recent call last):
  File "scraper.py", line 6, in <module>
    import requests
ImportError: No module named 'requests'

Okay, well then we’ll install it…

pip3 install requests

But then, another error!

The program 'pip3' is currently not installed. You can install it by typing:
apt-get install python3-pip

DO NOT INSTALL PIP WITH THAT. It will install a bad, old version. We can get the new, cool version with

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

Run pip --version to see if it talks about Python 3. That’s what we’re hoping.

Before we spend forever running into errors, there are a lot more dependencies that need to be isntalled. Dependencies are pieces of code that other code… depends on, and it isn’t always Python code. Use the commands below to install a whole bushel of dependencies. Some are probably already installed with python3-pip.

apt-get install build-essential
apt-get install python3-dev
apt-get install python3-numpy
apt-get install python3-scipy
apt-get install libatlas-dev
apt-get install ipython3
apt-get install python3-pandas
apt-get install libxml2-dev libxslt1-dev
apt-get install python3-matplotlib

So now let’s install requests

pip3 install requests

Now we’ll run the app again…

python3 scraper.py

But then another error…

Traceback (most recent call last):
  File "scraper.py", line 7, in <module>
    from bs4 import BeautifulSoup
ImportError: No module named 'bs4'

This is where virtual environments get magic. Skip down below to create a requirements.txt file, then come back up here.

Once you have a requirements.txt on your server, you can just tell pip to install everything in that list.

pip3 install -r requirements.txt

If you need to install packages for postgres or something, you might want to search and see what’s available using apt-cache search

apt-cache search psql apt-cache search postgres

Other commands

scp -i ~/.ssh/do-droplet root@107.170.60.133:~/*.csv  .

Creating a requirements.txt file

In the virtual environment that you can run scraper.py in, run the following command

pip freeze -l

This lists everything in your virtual environment, as well as their version numbers. Save them to a file with the following command

pip freeze -l > requirements.txt

You can cat the file if you’d like. Now scp it to your server with

scp -i ~/.ssh/do-droplet requirements.txt root@YOUR_IP:~/

Now you can return to your server up above.