Caching geocoding results with joblib¶

Sometimes you have a long long list of addresses to geocode, and some addresses show up more than once. It would be a waste of time and money to geocode the same address multiple times! There are a few ways to only geocode unique addresses and skip the duplicates.

A simple, built-in one is lru_cache. You can read more about how that works on our lru_cache walkthrough. The only downside to lru_cache is that it's stored in memory: if you restart your notebook, you lose everything it memorized!

A more robust approach to saving geocoded responses uses joblib.Memory, which can store the API's results to your hard drive.

In [81]:

            
                Copied!
                
import random
import pandas as pd
import random
import pandas as pd

Our sample geocoder¶

Our sample geocoder is called geocode_address. It takes an address, then returns a dictionary of latitude and longitude. Since I don't want to actually talk to a geocoding service I'm just having it return a random integer for both latitude and longitude.

In [82]:

            
                Copied!
                
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data

We can tell when our function runs by seeing it print the address.

In [83]:

            
                Copied!
                
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")

Geocoding 123 Peanut Street, Philadelphia, PA
Geocoding 3400 Walnut Road, Phoenix, AZ
Geocoding 123 Peanut Street, Philadelphia, PA
Done geocoding

In the example above, we see that it geocoded 123 Peanut Street twice, because it printed it out twice.

Using `joblib.Memory` to cache responses¶

joblib.Memory's @cache decorator tells the function to remember its responses, so if you ask it for the same thing twice it won't have to re-run the code. The benefit over lru_cache is that you can tell it to save the result to your hard drive, so it will work even if you start and stop your process.

We'll now adjust the code above to use this new setup! We need to do three things to make this work:

Import joblib
Create a folder to save the geocoded results
Attach the cache to the function

You might need to pip install joblib to install joblib. If you get a ModuleNotFoundError that is probably your solution!

In [85]:

            
                Copied!
                
                    
                    
                
                

        
from joblib import Memory

# I'm also saving this in a hidden directory called
# .cache so that you (probably) won't accidentally
# send it up to GitHub
memory = Memory('./.cache', verbose=0)

@memory.cache
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data
from joblib import Memory

# I'm also saving this in a hidden directory called
# .cache so that you (probably) won't accidentally
# send it up to GitHub
memory = Memory('./.cache', verbose=0)

@memory.cache
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data

Let's run our code and see how the geocoding works.

In [86]:

            
                Copied!
                
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")

Geocoding 123 Peanut Street, Philadelphia, PA
Geocoding 3400 Walnut Road, Phoenix, AZ
Done geocoding

Notice this time our geocoder only printed two addresses. This is because the second time the function sees 123 Peanut Street, Philadelphia, PA it remembers the previous answer.

Using `joblib.Memory` with pandas dataframes¶

Sometimes your pandas dataframe has separate column for street address, city, and state. Then when you're geocoding, you use your function in order to create the address that you're sending to the geocoding service.

In this situation, joblib has yet another benefit over lru_cache! While lru_cache doesn't work easily with row-by-row versions of .apply, joblib only needs one tiny change.

Here is our sample dataframe:

In [87]:

            
                Copied!
                
                    
                    
                
                

        
df = pd.DataFrame([
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
    { 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df
df = pd.DataFrame([
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
    { 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df

Out[87]:

	street	city	state
0	123 Peanut Street	Philadelphia	PA
1	3400 Walnut Road	Phoenix	AZ
2	123 Peanut Street	Philadelphia	PA

The problem with `memory.cache` and `df.apply`¶

Now we'll build our geocoder. It's similar to our original one, but this one takes a selection of columns and then formats it into an address suitable for geocoding. It then returns a pd.Series so that we can easily combine it with our original dataframe.

In [88]:

            
                Copied!
                
                    
                    
                
                

        
from joblib import Memory

memory = Memory('./.cache', verbose=0)

@memory.cache
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)
from joblib import Memory

memory = Memory('./.cache', verbose=0)

@memory.cache
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)

In [89]:

            
                Copied!
                
results = df.apply(geocode_address, axis=1)
df.join(results)
results = df.apply(geocode_address, axis=1)
df.join(results)

Geocoding 123 Peanut Street Philadelphia PA
Geocoding 3400 Walnut Road Phoenix AZ
Geocoding 123 Peanut Street Philadelphia PA

Out[89]:

	street	city	state	lat	lng
0	123 Peanut Street	Philadelphia	PA	-88.8052	49.5838
1	3400 Walnut Road	Phoenix	AZ	-56.6167	67.7750
2	123 Peanut Street	Philadelphia	PA	-23.7544	94.9965

Notice how even though we used @memory.cache, it still printed out 123 Peanut Street twice? Unlike lru_cache which gives an error, this one just silently fails. That's because it's working, just not in the way we intended! There are two issues:

When you use .apply, you're passing the row data along with the index. The fact that the first address is on row 0 and the third one is on row 2 ends up mattering! To fix this, we just need to remove the index before we call the function.
Along with all of the useful columns, you're also passing a bunch of other extra columns. If anything about those columns is different, then it won't cache the result!

Solution #1 to `joblib.Memory` and pandas dataframes¶

One fix involves specifying the columns we're interested in. This means changing both the def part of the function along with the way we use .apply.

In [90]:

            
                Copied!
                
                    
                    
                
                

        
from joblib import Memory

memory = Memory('./.cache', verbose=0)

# Changing geocode_address to accept three parameters
@memory.cache
def geocode_address(street, city, state):
    # Combine the columns into a string
    address = f"{street} {city} {state}"
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)
from joblib import Memory

memory = Memory('./.cache', verbose=0)

# Changing geocode_address to accept three parameters
@memory.cache
def geocode_address(street, city, state):
    # Combine the columns into a string
    address = f"{street} {city} {state}"
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)

In [91]:

            
                Copied!
                
# Only send the necessary columns
cols = ['street', 'city', 'state']
results = df[cols].apply(lambda row: geocode_address(**row), axis=1)
df.join(results)
# Only send the necessary columns
cols = ['street', 'city', 'state']
results = df[cols].apply(lambda row: geocode_address(**row), axis=1)
df.join(results)

Geocoding 123 Peanut Street Philadelphia PA
Geocoding 3400 Walnut Road Phoenix AZ

Out[91]:

	street	city	state	lat	lng
0	123 Peanut Street	Philadelphia	PA	-0.4089	163.3373
1	3400 Walnut Road	Phoenix	AZ	-13.5517	-123.3103
2	123 Peanut Street	Philadelphia	PA	-0.4089	163.3373

And there we go! It's a little ugly, but it works perfectly.

Solution #2 to `joblib.Memory` and pandas dataframes¶

Alternatively, don't create the full address is the function. Instead, create the address as a new column, then send that to the function.

Caching geocoding results with joblib¶

Our sample geocoder¶

Using joblib.Memory to cache responses¶

Using joblib.Memory with pandas dataframes¶

The problem with memory.cache and df.apply¶

Solution #1 to joblib.Memory and pandas dataframes¶

Solution #2 to joblib.Memory and pandas dataframes¶

Using `joblib.Memory` to cache responses¶

Using `joblib.Memory` with pandas dataframes¶

The problem with `memory.cache` and `df.apply`¶

Solution #1 to `joblib.Memory` and pandas dataframes¶

Solution #2 to `joblib.Memory` and pandas dataframes¶