Multiple pages of data from APIs

Sometimes when you’re dealing with an API, it doesn’t give you all of the results it knows about.

For example, let’s use the Star Wars API to search for everyone with the letter a in their name.

import requests

response = requests.get("https://swapi.co/api/people/?search=a")
data = response.json()
data

{'count': 60,
 'next': 'https://swapi.co/api/people/?search=a&page=2',
 'previous': None,
 'results': [{'birth_year': '19BBY',
   'created': '2014-12-09T13:50:51.644000Z',
   'edited': '2014-12-20T21:17:56.891000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/',
    'https://swapi.co/api/films/7/'],
   'gender': 'male',
   'hair_color': 'blond',
   'height': '172',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '77',
   'name': 'Luke Skywalker',
   'skin_color': 'fair',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': ['https://swapi.co/api/starships/12/',
    'https://swapi.co/api/starships/22/'],
   'url': 'https://swapi.co/api/people/1/',
   'vehicles': ['https://swapi.co/api/vehicles/14/',
    'https://swapi.co/api/vehicles/30/']},
  {'birth_year': '41.9BBY',
   'created': '2014-12-10T15:18:20.704000Z',
   'edited': '2014-12-20T21:17:50.313000Z',
   'eye_color': 'yellow',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/'],
   'gender': 'male',
   'hair_color': 'none',
   'height': '202',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '136',
   'name': 'Darth Vader',
   'skin_color': 'white',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': ['https://swapi.co/api/starships/13/'],
   'url': 'https://swapi.co/api/people/4/',
   'vehicles': []},
  {'birth_year': '19BBY',
   'created': '2014-12-10T15:20:09.791000Z',
   'edited': '2014-12-20T21:17:50.315000Z',
   'eye_color': 'brown',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/',
    'https://swapi.co/api/films/7/'],
   'gender': 'female',
   'hair_color': 'brown',
   'height': '150',
   'homeworld': 'https://swapi.co/api/planets/2/',
   'mass': '49',
   'name': 'Leia Organa',
   'skin_color': 'light',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': [],
   'url': 'https://swapi.co/api/people/5/',
   'vehicles': ['https://swapi.co/api/vehicles/30/']},
  {'birth_year': '52BBY',
   'created': '2014-12-10T15:52:14.024000Z',
   'edited': '2014-12-20T21:17:50.317000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/5/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/1/'],
   'gender': 'male',
   'hair_color': 'brown, grey',
   'height': '178',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '120',
   'name': 'Owen Lars',
   'skin_color': 'light',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': [],
   'url': 'https://swapi.co/api/people/6/',
   'vehicles': []},
  {'birth_year': '47BBY',
   'created': '2014-12-10T15:53:41.121000Z',
   'edited': '2014-12-20T21:17:50.319000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/5/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/1/'],
   'gender': 'female',
   'hair_color': 'brown',
   'height': '165',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '75',
   'name': 'Beru Whitesun lars',
   'skin_color': 'light',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': [],
   'url': 'https://swapi.co/api/people/7/',
   'vehicles': []},
  {'birth_year': '24BBY',
   'created': '2014-12-10T15:59:50.509000Z',
   'edited': '2014-12-20T21:17:50.323000Z',
   'eye_color': 'brown',
   'films': ['https://swapi.co/api/films/1/'],
   'gender': 'male',
   'hair_color': 'black',
   'height': '183',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '84',
   'name': 'Biggs Darklighter',
   'skin_color': 'light',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': ['https://swapi.co/api/starships/12/'],
   'url': 'https://swapi.co/api/people/9/',
   'vehicles': []},
  {'birth_year': '57BBY',
   'created': '2014-12-10T16:16:29.192000Z',
   'edited': '2014-12-20T21:17:50.325000Z',
   'eye_color': 'blue-gray',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/5/',
    'https://swapi.co/api/films/4/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/'],
   'gender': 'male',
   'hair_color': 'auburn, white',
   'height': '182',
   'homeworld': 'https://swapi.co/api/planets/20/',
   'mass': '77',
   'name': 'Obi-Wan Kenobi',
   'skin_color': 'fair',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': ['https://swapi.co/api/starships/48/',
    'https://swapi.co/api/starships/59/',
    'https://swapi.co/api/starships/64/',
    'https://swapi.co/api/starships/65/',
    'https://swapi.co/api/starships/74/'],
   'url': 'https://swapi.co/api/people/10/',
   'vehicles': ['https://swapi.co/api/vehicles/38/']},
  {'birth_year': '41.9BBY',
   'created': '2014-12-10T16:20:44.310000Z',
   'edited': '2014-12-20T21:17:50.327000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/5/',
    'https://swapi.co/api/films/4/',
    'https://swapi.co/api/films/6/'],
   'gender': 'male',
   'hair_color': 'blond',
   'height': '188',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'mass': '84',
   'name': 'Anakin Skywalker',
   'skin_color': 'fair',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': ['https://swapi.co/api/starships/59/',
    'https://swapi.co/api/starships/65/',
    'https://swapi.co/api/starships/39/'],
   'url': 'https://swapi.co/api/people/11/',
   'vehicles': ['https://swapi.co/api/vehicles/44/',
    'https://swapi.co/api/vehicles/46/']},
  {'birth_year': '64BBY',
   'created': '2014-12-10T16:26:56.138000Z',
   'edited': '2014-12-20T21:17:50.330000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/6/', 'https://swapi.co/api/films/1/'],
   'gender': 'male',
   'hair_color': 'auburn, grey',
   'height': '180',
   'homeworld': 'https://swapi.co/api/planets/21/',
   'mass': 'unknown',
   'name': 'Wilhuff Tarkin',
   'skin_color': 'fair',
   'species': ['https://swapi.co/api/species/1/'],
   'starships': [],
   'url': 'https://swapi.co/api/people/12/',
   'vehicles': []},
  {'birth_year': '200BBY',
   'created': '2014-12-10T16:42:45.066000Z',
   'edited': '2014-12-20T21:17:50.332000Z',
   'eye_color': 'blue',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/',
    'https://swapi.co/api/films/7/'],
   'gender': 'male',
   'hair_color': 'brown',
   'height': '228',
   'homeworld': 'https://swapi.co/api/planets/14/',
   'mass': '112',
   'name': 'Chewbacca',
   'skin_color': 'unknown',
   'species': ['https://swapi.co/api/species/3/'],
   'starships': ['https://swapi.co/api/starships/10/',
    'https://swapi.co/api/starships/22/'],
   'url': 'https://swapi.co/api/people/13/',
   'vehicles': ['https://swapi.co/api/vehicles/19/']}]}

It looks like a lot of stuff, but let’s examine it a little more closely. How many results is it, really?

data['count']

Okay, cool, 60 results! Let’s loop through them.

for person in data['results']:
    print(person['name'])

Luke Skywalker
Darth Vader
Leia Organa
Owen Lars
Beru Whitesun lars
Biggs Darklighter
Obi-Wan Kenobi
Anakin Skywalker
Wilhuff Tarkin
Chewbacca

Wait a second, that’s not 60 people! It’s… a lot less.

len(data['results'])

It’s… it’s 10! How do we only have 10 results if data['count'] says we should have 60?

Pagination in an API

Most APIs that allow you to search only return some of the results at a time. In this case, you get 10 results at a time, even though there are 60 total. But, to be helpful, the API comes with a next key that tells you where to find more.

print(data['next'])

https://swapi.co/api/people/?search=a&page=2

All we need to do to get page 2 is to make a request to that page…

response = requests.get("https://swapi.co/api/people/?search=a&page=2")
data = response.json()

for person in data['results']:
    print(person['name'])

Han Solo
Jabba Desilijic Tiure
Wedge Antilles
Yoda
Palpatine
Boba Fett
Lando Calrissian
Ackbar
Mon Mothma
Arvel Crynyd

…and we get everyone who is on that second page.

Remember how our data['next'] on page 1 gave us the URL to page 2? On page 2, data['next'] will also point to the next page, page 3.

print(data['next'])

https://swapi.co/api/people/?search=a&page=3

If we keep going and going and going, eventually the next page doesn’t exist any more. In this case, it happens on page 6.

response = requests.get("https://swapi.co/api/people/?search=a&page=6")
data = response.json()

print(data['next'])

None

When data['next'] is None, we’re finally at the end.

How does this work when getting data from an API, though? Are we supposed to keep changing the page number time after time by hand?

No!

There’s an easier way.

Scraping all of the pages at once

Technically, there are two easier ways to do this, not just one. The first way involves a cool new kind of loop called a while loop, while the second uses a normal for loop.

METHOD ONE: `while` loop

A while loop is kind of like an if statement. For example, maybe we’re wondering if we need to get a second page of results:

# Grab the search results
print("Downloading the original search results")
response = requests.get("https://swapi.co/api/people/?search=a")
data = response.json()

# If data['next'] isn't empty, let's download the next page, too
if data['next'] is not None:
    print("Next page found, downloading", data['next'])
    response = requests.get(data['next'])
    data = response.json()

Downloading the original search results
Next page found, downloading https://swapi.co/api/people/?search=a&page=2

The way a while loop works is that it keeps doing something until the statement is False. if does something once, and while does something forever (maybe).

So in this case, it’s going to keep downloading pages as long as data['next'] is not None. In other words, it will only stop when data['next'] is empty.

Let’s change our if to while:

# Grab the search results
print("Downloading the original search results")
response = requests.get("https://swapi.co/api/people/?search=a")
data = response.json()

# While data['next'] isn't empty, let's download the next page, too
while data['next'] is not None:
    print("Next page found, downloading", data['next'])
    response = requests.get(data['next'])
    data = response.json()

Downloading the original search results
Next page found, downloading https://swapi.co/api/people/?search=a&page=2
Next page found, downloading https://swapi.co/api/people/?search=a&page=3
Next page found, downloading https://swapi.co/api/people/?search=a&page=4
Next page found, downloading https://swapi.co/api/people/?search=a&page=5
Next page found, downloading https://swapi.co/api/people/?search=a&page=6

We just need one small change - let’s make an empty list of total_results and keep adding data['results'] to it each time.

# Start with an empty list
total_results = []

# Grab the search results
print("Downloading the original search results")
response = requests.get("https://swapi.co/api/people/?search=a")
data = response.json()

# Store the first page of results
total_results = total_results + data['results']

# While data['next'] isn't empty, let's download the next page, too
while data['next'] is not None:
    print("Next page found, downloading", data['next'])
    response = requests.get(data['next'])
    data = response.json()
    # Store the current page of results
    total_results = total_results + data['results']

print("We have", len(total_results), "total results")

Downloading the original search results
Next page found, downloading https://swapi.co/api/people/?search=a&page=2
Next page found, downloading https://swapi.co/api/people/?search=a&page=3
Next page found, downloading https://swapi.co/api/people/?search=a&page=4
Next page found, downloading https://swapi.co/api/people/?search=a&page=5
Next page found, downloading https://swapi.co/api/people/?search=a&page=6
We have 60 total results

METHOD TWO: `for` loop and `range`

I think while loops can be trouble because if you write them wrong, your program might run forever! This is pretty bad!

If you know how many pages you need to go through, though, you can use a for loop instead.

In this case, we know we need to get everything between page 1 and page 6.

https://swapi.co/api/people/?search=a&page=1
https://swapi.co/api/people/?search=a&page=2
https://swapi.co/api/people/?search=a&page=3
https://swapi.co/api/people/?search=a&page=4
https://swapi.co/api/people/?search=a&page=5
https://swapi.co/api/people/?search=a&page=6

A boring way to do this is to make a list of numbers, and loop through it.

for page_num in [1, 2, 3, 4, 5, 6]:
    url = f"https://swapi.co/api/people/?search=a&page={page_num}"
    print(url)

https://swapi.co/api/people/?search=a&page=1
https://swapi.co/api/people/?search=a&page=2
https://swapi.co/api/people/?search=a&page=3
https://swapi.co/api/people/?search=a&page=4
https://swapi.co/api/people/?search=a&page=5
https://swapi.co/api/people/?search=a&page=6

If that’s too much typing, Python can also help out. The range function will automatically build the list for you.

range(6) will give you [0, 1, 2, 3, 4, 5], so you can either do + 1 on that or range(1,7) to get [1, 2, 3, 4, 5, 6].

for page_num in range(1, 7):
    url = f"https://swapi.co/api/people/?search=a&page={page_num}"
    print(url)

https://swapi.co/api/people/?search=a&page=1
https://swapi.co/api/people/?search=a&page=2
https://swapi.co/api/people/?search=a&page=3
https://swapi.co/api/people/?search=a&page=4
https://swapi.co/api/people/?search=a&page=5
https://swapi.co/api/people/?search=a&page=6

Once you have all of the pages, you can do what we did before - each time through the loop, request the page and take the results.

# Start with an empty list
total_results = []

# Loop through from pages 1 to 6
for page_num in range(1, 7):
    # Build the URL and download the results
    url = f"https://swapi.co/api/people/?search=a&page={page_num}"
    print("Downloading", url)
    response = requests.get(url)
    data = response.json()
    total_results = total_results + data['results']


print("We have", len(total_results), "total results")

Downloading https://swapi.co/api/people/?search=a&page=1
Downloading https://swapi.co/api/people/?search=a&page=2
Downloading https://swapi.co/api/people/?search=a&page=3
Downloading https://swapi.co/api/people/?search=a&page=4
Downloading https://swapi.co/api/people/?search=a&page=5
Downloading https://swapi.co/api/people/?search=a&page=6
We have 60 total results

This might be easier to read, but there’s one problem: how do you know you have 6 pages? Honestly, nothing automatic - you probably manually get the first page, then calculate how many pages it is. It’s a little more work, but if it makes more sense, go for it.

Pagination in an API

Scraping all of the pages at once

METHOD ONE: while loop

METHOD TWO: for loop and range

METHOD ONE: `while` loop

METHOD TWO: `for` loop and `range`