Do-Now

1) What’s the output of the following lines?

name = "Tara"
print "hello", "name"
print "hello", name

Output: hello name hello Tara

2) What is the output of the following code? Which lines give you errors?

borough_name = 'Manhattan'
z = [ 'Manhattan', 'Queens' ]
x = { 'borough_name': 'Manhattan', 'population': 500 }
y = {
'Manhattan': 500,
'Queens': 200
}

print x['borough_name'] # Manhattan
print x[borough_name] # ERROR; no key named 'Manhattan'
print x[0] # ERROR; x is a dictionary
print y['borough_name'] # ERROR; no key named 'borough_name'
print y[borough_name] # 500
print y[0] # ERROR; y is a dictionary
print z['borough_name'] # ERROR; z is a list
print z[borough_name] # ERROR; z is a list
print z[0] # Manhattan

Errors and output are noted in comments above.

3) We have a list of integers called numbers. Loop through them but instead of printing the number print a number 100 times as big as the original number.

numbers = [1, 2, 3, 4]
for number in numbers:
    print number * 100

Output: 100 200 300 400

4a) Given the following, write code to calculate how many murders we have in total.

murders = { 'Albany': 23, 'Kings County': 10, 'Rochester': 7, 'Yonkers': 9 }

Possible solution:

total_murders = murders['Albany'] + murders['Kings County'] + murders['Rochester'] + murders['Yonkers']

Another possible way:

total_murders = 0
cities = ['Albany', 'Kings County', 'Rochester', 'Yonkers']
for city in cities:
    # print(murders[city])
    # instead of printing, let's keep a running count
    total_murders = total_murders + murders[city]
print(total_murders)

Yet another way:

total_murders = 0
for city in murders.keys():
    total_murders = total_murders + murders[city]
print(total_murders)

Another: sum(murder.values())

4b) Write code to calculate the percentage of the murders happened in Kings County.

print(murders['Kings County'] / total_murders)

APIs

Start a new python script (name it as you please) and start a virtual environment in the directory where that file is located and call it ‘apis’ (mkvirtualenv apis).

  • An aside: the reason why we’re using a virtual environment: virtual environments allow us to partition off what we need so that when you upgrade things in the future, it doesn’t break things from the past.

API stands for application programming interface.

Importing requests

We’re going to use a library called requests to work with APIs. You don’t have to write 100s of lines of code to make a request to a website because we have the code that someone else already wrote (you can even view the code for requests on Github).

(We’re going to be using the following 3 lines of code over and over again! Well, we’ll change the URL we pass, but mostly use variations of the lines below.)

import requests

response = requests.get('http://www.nytimes.com')
print(response.text)

But this won’t work until we install the requests package on our computer. pip3 install requests (make sure you are in a virtual environment when you do this).

A scraper teaches your computer to how to interpret what it gets back from response.text from a website! An API gets you data back from a source in a format that your computer can understand. APIs usually come with documentation that tells you how its data is structured.

Example: Let’s go check out the Spotify API

import requests
response = requests.get('https://api.spotify.com/v1/search?query=80s&type=playlist')
  • API Endpoint: This just means ‘URL that your computer can visit’
    • Everything to the right of the ? are the parameters
    • To the left, we have the URL
  • API Endpoint Reference: The API Endpoint reference tells us all the parameters we can use. This is usually a good first stop when you begin working with a particular API – read their documentation!
data = response.json()
print(data.keys())

Note that we are passing response.json() to data, rather than response.text – given the format of the data we get back from Spotify, we know (now) that we need to interpret it as JSON (stands for JavaScript Object Notation)!

Whenever we get new “data”, we’re going to be using our tried and test methods to “poke through” and get a better understanding of what that data actually looks like. A few of our favorite ways of investigating the data:

print(type(data['playlists']))  # Tells us its a dictionary
# So we know we can print its keys
print(data['playlists'].keys())  # we know we want the items key!

playlists = data['playlists']['items']
print(type(playlists))  # it's a list

for playlist in playlists:
    print(playlist['name'], playlist['href'])

Print, print, print! And for APIs, you can always use the documentation to guide your investigations.

API authorization

  • Some APIs require authorizations to access their data – this is the case for some of Spotify’s data, which uses an authentication tool called OAuth. Basically, they want to know who you are.
  • Other APIs have simplied their authentication process – they’ll issue you an API key, which serves as a unique identifier that again, allows them to know who is accessing their data.

Starting exploratory data analysis

We’re going to move to ipython/Jupyter notebook to look through our Spotify data. Jupyter is notebooks are particularly great for exploratory data analysis. Instead of running our entire script each time we make a small change, Jupyter notebooks allow us to run our code in smaller cells.

A note for exploratory data analysis: don’t just blindly poke around your data. Instead, form a question and try to get the answer out of your data.

Here’s what we did in class: