Using lru_cache to avoid geocoding duplicate addresses¶
Sometimes you have a long long list of addresses to geocode, and some addresses show up more than once. It would be a waste of time and money to geocode the same address multiple times! There are a few ways to only geocode unique addresses and skip the duplicates, but one of the easiest to use involves a tool called lru_cache
.
Another useful approach leverages a library called
joblib
. It saves your cache to disk, so it even works after multiple runs! You can read about geocoding caching with joblib here
Let's examine the problem, and how lru_cache
is a great solution! We start by looking at how it works with simple functions, then use a pandas dataframe later on in the process.
import random
import pandas as pd
Our sample geocoder¶
Our sample geocoder is called geocode_address
. It takes an address, then returns a dictionary of latitude and longitude. Since I don't want to actually talk to a geocoding service I'm just having it return a random integer for both latitude and longitude.
def geocode_address(address):
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return data
We can tell when our function runs by seeing it print the address.
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')
print("Done geocoding")
Geocoding 123 Peanut Street, Philadelphia, PA Geocoding 3400 Walnut Road, Phoenix, AZ Geocoding 123 Peanut Street, Philadelphia, PA Done geocoding
In the example above, we see that it geocoded 123 Peanut Street twice, because it printed it out twice.
Using @functools.lru_cache
to cache responses¶
@functools.lru_cache
is a "decorator" that can be used with Python functions. It tells the function to remember its responses, so if you ask it for the same thing twice it won't have to re-run the code.
We will adjust the code above to include import functools
at the top, and @fundtools.lru_cache
before we declare our function.
import functools
@functools.lru_cache
def geocode_address(address):
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return data
Let's run our code and see how the geocoding works.
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')
print("Done geocoding")
Geocoding 123 Peanut Street, Philadelphia, PA Geocoding 3400 Walnut Road, Phoenix, AZ Done geocoding
Notice this time our geocoder only printed two addresses. This is because the second time the function sees 123 Peanut Street, Philadelphia, PA
it remembers the previous answer.
Using lru_cache
with dataframes¶
When you get into using pandas and dataframes, there's one think to watch out for with lru_cache
.
Sometimes your dataframe has separate column for street address, city, and state. Then when you're geocoding, you use your function in order to create the address that you're sending to the geocoding service. Here is our sample dataframe:
df = pd.DataFrame([
{ 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
{ 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
{ 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df
street | city | state | |
---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA |
1 | 3400 Walnut Road | Phoenix | AZ |
2 | 123 Peanut Street | Philadelphia | PA |
Now we'll build our geocoder. It's similar to our original one, but this one takes an entire row of data and then formats it into an address suitable for geocoding. It then returns a pd.Series
so that we can easily combine it with our original dataframe.
def geocode_address(row):
# Combine the columns into a string
address = "{street} {city} {state}".format(**row)
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return pd.Series(data)
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)
Geocoding 123 Peanut Street Philadelphia PA Geocoding 3400 Walnut Road Phoenix AZ Geocoding 123 Peanut Street Philadelphia PA
street | city | state | lat | lng | |
---|---|---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA | 8.4024 | -19.7705 |
1 | 3400 Walnut Road | Phoenix | AZ | -70.1524 | -175.0852 |
2 | 123 Peanut Street | Philadelphia | PA | 61.6642 | 166.9208 |
This works the same as before - 123 Peanut Street is geocoded twice because we aren't caching the results.
The problem with lru_cache
and df.apply
¶
You might think we can add @functools.lru_cache
just like we did previously:
# This won't work!!!
@functools.lru_cache
def geocode_address(row):
# Combine the columns into a string
address = "{street} {city} {state}".format(**row)
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return pd.Series(data)
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)
Unfortunately, that won't work! If we try to run this code, though we end up with an TypeError: unhashable type: 'Series' error that looks like this:
TypeError Traceback (most recent call last)
1 # Geocode the addresses and merge it with the original data
2 results = df.apply(geocode_address, axis=1)
3 df.join(results)
.....
871 for i, v in enumerate(series_gen):
872 # ignore SettingWithCopy here in case the user mutates
873 results[i] = self.f(v)
874 if isinstance(results[i], ABCSeries):
875 # If we have a view on v, we need to make a copy because
TypeError: unhashable type: 'Series'
This is because lru_cache
can only remember really basic kinds of data. Instead of just asking the geocoder to remember the address we're geocoding, we're asking it to remember every column of our row! It doesn't want to do that, so it gives an error.
The solution for lru_cache
and df.apply
¶
The solution is simple: only use lru_cache
with single columns of data. If you have to build an address in your geocoder, build it outside of the function!
# Build an address column
df['address'] = df['street'] + ', ' + df['city'] + ', ' + df['state']
df
street | city | state | address | |
---|---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA | 123 Peanut Street, Philadelphia, PA |
1 | 3400 Walnut Road | Phoenix | AZ | 3400 Walnut Road, Phoenix, AZ |
2 | 123 Peanut Street | Philadelphia | PA | 123 Peanut Street, Philadelphia, PA |
Now that we have an address column, that's all we need to send to our geocoding function.
@functools.lru_cache
def geocode_address(address):
print("Geocoding", address)
data = {
'lat': round(random.random() * 180, 4) - 90,
'lng': round(random.random() * 360, 4) - 180
}
# Return the result
return pd.Series(data)
results = df.address.apply(geocode_address)
df.join(results)
Geocoding 123 Peanut Street, Philadelphia, PA Geocoding 3400 Walnut Road, Phoenix, AZ
street | city | state | address | lat | lng | |
---|---|---|---|---|---|---|
0 | 123 Peanut Street | Philadelphia | PA | 123 Peanut Street, Philadelphia, PA | 56.1181 | 2.5285 |
1 | 3400 Walnut Road | Phoenix | AZ | 3400 Walnut Road, Phoenix, AZ | 3.4972 | 57.8574 |
2 | 123 Peanut Street | Philadelphia | PA | 123 Peanut Street, Philadelphia, PA | 56.1181 | 2.5285 |
Notice how it only prints out two addresses, even though we have three results. @functools.lru
did a great job!