Caching geocoding results with joblib¶
Sometimes you have a long long list of addresses to geocode, and some addresses show up more than once. It would be a waste of time and money to geocode the same address multiple times! There are a few ways to only geocode unique addresses and skip the duplicates.
A simple, built-in one is lru_cache
. You can read more about how that works on our lru_cache
walkthrough. The only downside to lru_cache
is that it's stored in memory: if you restart your notebook, you lose everything it memorized!
A more robust approach to saving geocoded responses uses joblib.Memory, which can store the API's results to your hard drive.
import random
import pandas as pd
Our sample geocoder¶
Our sample geocoder is called geocode_address
. It takes an address, then returns a dictionary of latitude and longitude. Since I don't want to actually talk to a geocoding service I'm just having it return a random integer for both latitude and longitude.
def geocode_address(address):
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return data
We can tell when our function runs by seeing it print the address.
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')
print("Done geocoding")
Geocoding 123 Peanut Street, Philadelphia, PA Geocoding 3400 Walnut Road, Phoenix, AZ Geocoding 123 Peanut Street, Philadelphia, PA Done geocoding
In the example above, we see that it geocoded 123 Peanut Street twice, because it printed it out twice.
Using joblib.Memory
to cache responses¶
joblib.Memory's @cache decorator tells the function to remember its responses, so if you ask it for the same thing twice it won't have to re-run the code. The benefit over lru_cache
is that you can tell it to save the result to your hard drive, so it will work even if you start and stop your process.
We'll now adjust the code above to use this new setup! We need to do three things to make this work:
- Import
joblib
- Create a folder to save the geocoded results
- Attach the cache to the function
You might need to
pip install joblib
to install joblib. If you get aModuleNotFoundError
that is probably your solution!
from joblib import Memory
# I'm also saving this in a hidden directory called
# .cache so that you (probably) won't accidentally
# send it up to GitHub
memory = Memory('./.cache', verbose=0)
@memory.cache
def geocode_address(address):
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return data
Let's run our code and see how the geocoding works.
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')
print("Done geocoding")
Geocoding 123 Peanut Street, Philadelphia, PA Geocoding 3400 Walnut Road, Phoenix, AZ Done geocoding
Notice this time our geocoder only printed two addresses. This is because the second time the function sees 123 Peanut Street, Philadelphia, PA
it remembers the previous answer.
Using joblib.Memory
with pandas dataframes¶
Sometimes your pandas dataframe has separate column for street address, city, and state. Then when you're geocoding, you use your function in order to create the address that you're sending to the geocoding service.
In this situation, joblib
has yet another benefit over lru_cache
! While lru_cache
doesn't work easily with row-by-row versions of .apply
, joblib
only needs one tiny change.
Here is our sample dataframe:
df = pd.DataFrame([
{ 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
{ 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
{ 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df
street | city | state | |
---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA |
1 | 3400 Walnut Road | Phoenix | AZ |
2 | 123 Peanut Street | Philadelphia | PA |
The problem with memory.cache
and df.apply
¶
Now we'll build our geocoder. It's similar to our original one, but this one takes a selection of columns and then formats it into an address suitable for geocoding. It then returns a pd.Series
so that we can easily combine it with our original dataframe.
from joblib import Memory
memory = Memory('./.cache', verbose=0)
@memory.cache
def geocode_address(row):
# Combine the columns into a string
address = "{street} {city} {state}".format(**row)
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return pd.Series(data)
results = df.apply(geocode_address, axis=1)
df.join(results)
Geocoding 123 Peanut Street Philadelphia PA Geocoding 3400 Walnut Road Phoenix AZ Geocoding 123 Peanut Street Philadelphia PA
street | city | state | lat | lng | |
---|---|---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA | -88.8052 | 49.5838 |
1 | 3400 Walnut Road | Phoenix | AZ | -56.6167 | 67.7750 |
2 | 123 Peanut Street | Philadelphia | PA | -23.7544 | 94.9965 |
Notice how even though we used @memory.cache
, it still printed out 123 Peanut Street twice? Unlike lru_cache
which gives an error, this one just silently fails. That's because it's working, just not in the way we intended! There are two issues:
- When you use
.apply
, you're passing the row data along with the index. The fact that the first address is on row0
and the third one is on row2
ends up mattering! To fix this, we just need to remove the index before we call the function. - Along with all of the useful columns, you're also passing a bunch of other extra columns. If anything about those columns is different, then it won't cache the result!
Solution #1 to joblib.Memory
and pandas dataframes¶
One fix involves specifying the columns we're interested in. This means changing both the def
part of the function along with the way we use .apply
.
from joblib import Memory
memory = Memory('./.cache', verbose=0)
# Changing geocode_address to accept three parameters
@memory.cache
def geocode_address(street, city, state):
# Combine the columns into a string
address = f"{street} {city} {state}"
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return pd.Series(data)
# Only send the necessary columns
cols = ['street', 'city', 'state']
results = df[cols].apply(lambda row: geocode_address(**row), axis=1)
df.join(results)
Geocoding 123 Peanut Street Philadelphia PA Geocoding 3400 Walnut Road Phoenix AZ
street | city | state | lat | lng | |
---|---|---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA | -0.4089 | 163.3373 |
1 | 3400 Walnut Road | Phoenix | AZ | -13.5517 | -123.3103 |
2 | 123 Peanut Street | Philadelphia | PA | -0.4089 | 163.3373 |
And there we go! It's a little ugly, but it works perfectly.
Solution #2 to joblib.Memory
and pandas dataframes¶
Alternatively, don't create the full address is the function. Instead, create the address as a new column, then send that to the function.