Servers
To create a new server
- Open up digital ocean
- Create Droplet
- Ubuntu 14.04.4 x64
- $5/mo
- New York is fine
- New SSH Key
- Paste in our SSH public key, naming it
do-droplet
(key-creating details below) - Make sure you’re only creating one
- You can give it a unique hostname if you’d like. Maybe
cronmachine
? - Click Create and wait maybe 60 seconds for it to start up.
How to create an SSH key
We’re changing these GitHub directions a little bit. Run the following to start the ssh-keygen
ssh key generator.
ssh-keygen -t rsa -b 4096 -C "YOUR_EMAIL@EXAMPLE.COM"
It will say to you:
Enter a file in which to save the key (/Users/YOUR_USERNAME/.ssh/id_rsa):
WE DON’T WANT TO USE THIS. We want to make a new one, but in the same .ssh directory. We’re going to call it do-droplet
. We can accomplish that by entering this as the filename:
~/.ssh/do-droplet
Don’t type anything when it asks you for a passphrase, just hit enter twice. It will create two files for you, your public and private keys.
Take a look at your private key with cat
. Never give this away!
cat ~/.ssh/do-droplet
Now take a look at your public key, also with cat
. This is the one we’ll paste.
cat ~/.ssh/do-droplet.pub
The next two commands will come in handy later, but don’t run them yet.
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/do-droplet
Connecting to our server
Open up the page for your droplet, and make a note of your IP Address. DO refers to it as IPv4. My IP is 107.170.91.48
.
Now we need to connect to our server using ssh
. We’ll be connecting as root
, the A+ best killer cool user.
ssh root@YOUR_IP
There will be a question. The answer is yes. It will ask you for a password. You don’t have a password!
Instead of passwords, we’re using public and private keys. Ctrl+C
to quit logging in, and let’s try to log in again using our key (our identity).
ssh -i ~/.ssh/do-droplet root@107.170.91.48
Success!
Running things on our server
Note the error-y message about packages needing updates
pwd
ls
curl http://www.nytimes.com
curl http://www.nytimes.com > nyt.txt
ls
cat nyt.txt
python --version
python3 --version
Let’s update those packages.
apt-get update
apt-get upgrade
Let’s also install a few other packages
apt-get install mailutils
When you get a pink screen asking you about mail setup, pick ….internet.
Our scraper.py
file
Save this file on your local machine:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
response = requests.get("http://www.nytimes.com"../)
doc = BeautifulSoup(response.text, 'html.parser')
stories = doc.find_all("article", { 'class': 'story' })
all_stories = []
# Grab their headlines and bylines
for story in stories:
# Grab all of the h2's inside of the story
headline = story.find('h2', {'class': 'story-heading'})
# If a headline exists, then process the rest!
if headline:
# They're COVERED in whitespace
headline_text = headline.text.strip()
# Make a dictionary with the headline
this_story = { 'headline': headline_text }
byline = story.find('p', {'class': 'byline'})
# Not all of them have a byline
if byline:
byline_text = byline.text.strip()
this_story['byline'] = byline_text
all_stories.append(this_story)
all_stories
stories_df = pd.DataFrame(all_stories)
stories_df.to_csv("nyt-data.csv")
datestring = time.strftime("%Y-%m-%d-%H-%M")
filename = "nyt-data-" + datestring + ".csv"
stories_df.to_csv(filename, index=False)
Crontab on our server
Now we want to run crontab.
crontab -e
It asks us what editor we want, how nice. Hit enter, nano is default. Enter this somewhere:
*/2 * * * * curl http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt
There’s a \
before the %
which wasn’t there before! It’s because cron is slightly different on Ubuntu vs. OS X and it uses a percent symbol as a new line. We need to escape it to say hey, really, we want to use a percent symbol here, not a newline.
Save/exit.
How would we know if we had an error? Let’s wait a few minutes. If you’re getting antsy about when it will be a divisible-by-two minute type date
.
Checking your mail and silencing curl
Eventually the machine says You have new mail.
Check it with… mail
.
mail
You’re selecting the most recent message, hit enter. Oh, look, the output of curl
, how sweet.
Exit with x
.
Whenever you run a command, the output gets mailed to you. We… don’t really want this. The simplest way for now is to make curl silent by using the --silent
flag.
crontab -e
Then change your line to have --silent
with curl.
*/2 * * * * curl --silent http://www.nytimes.com > ~/nyt-`date +"\%s"`.txt
Now you won’t be hassled any more. =
This mail is also how you’ll know if you have an error in your code.
Transferring files
Now we want to take our scraper.py
and transfer it from our local machine to our server. We’ll be copying it, which is usually cp
, but because we’re doing it over ssh it’s scp
.
Let’s copy it from our local machine. From your machine, run:
scp -i ~/.ssh/do-droplet scraper.py root@YOUR_IP:~/
This will send a file from your local machine to your remote machine, while logging in with your identity file (private key).
Setting up Python
Back on your machine, try to run your script using python3
.
python3 scraper.py
In return, you get a joyous error message!
Traceback (most recent call last):
File "scraper.py", line 6, in <module>
import requests
ImportError: No module named 'requests'
Okay, well then we’ll install it…
pip3 install requests
But then, another error!
The program 'pip3' is currently not installed. You can install it by typing:
apt-get install python3-pip
DO NOT INSTALL PIP WITH THAT. It will install a bad, old version. We can get the new, cool version with
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
Run pip --version
to see if it talks about Python 3. That’s what we’re hoping.
Before we spend forever running into errors, there are a lot more dependencies that need to be isntalled. Dependencies are pieces of code that other code… depends on, and it isn’t always Python code. Use the commands below to install a whole bushel of dependencies. Some are probably already installed with python3-pip
.
apt-get install build-essential
apt-get install python3-dev
apt-get install python3-numpy
apt-get install python3-scipy
apt-get install libatlas-dev
apt-get install ipython3
apt-get install python3-pandas
apt-get install libxml2-dev libxslt1-dev
apt-get install python3-matplotlib
So now let’s install requests
pip3 install requests
Now we’ll run the app again…
python3 scraper.py
But then another error…
Traceback (most recent call last):
File "scraper.py", line 7, in <module>
from bs4 import BeautifulSoup
ImportError: No module named 'bs4'
This is where virtual environments get magic. Skip down below to create a requirements.txt
file, then come back up here.
Once you have a requirements.txt
on your server, you can just tell pip
to install everything in that list.
pip3 install -r requirements.txt
If you need to install packages for postgres
or something, you might want to search and see what’s available using apt-cache search
apt-cache search psql apt-cache search postgres
Other commands
scp -i ~/.ssh/do-droplet root@107.170.60.133:~/*.csv .
Creating a requirements.txt file
In the virtual environment that you can run scraper.py
in, run the following command
pip freeze -l
This lists everything in your virtual environment, as well as their version numbers. Save them to a file with the following command
pip freeze -l > requirements.txt
You can cat
the file if you’d like. Now scp
it to your server with
scp -i ~/.ssh/do-droplet requirements.txt root@YOUR_IP:~/
Now you can return to your server up above.