Using lru_cache to avoid geocoding duplicate addresses¶

Sometimes you have a long long list of addresses to geocode, and some addresses show up more than once. It would be a waste of time and money to geocode the same address multiple times! There are a few ways to only geocode unique addresses and skip the duplicates, but one of the easiest to use involves a tool called lru_cache.

Another useful approach leverages a library called joblib. It saves your cache to disk, so it even works after multiple runs! You can read about geocoding caching with joblib here

Let's examine the problem, and how lru_cache is a great solution! We start by looking at how it works with simple functions, then use a pandas dataframe later on in the process.

In [52]:

            
                Copied!
                
import random
import pandas as pd
import random
import pandas as pd

Our sample geocoder¶

Our sample geocoder is called geocode_address. It takes an address, then returns a dictionary of latitude and longitude. Since I don't want to actually talk to a geocoding service I'm just having it return a random integer for both latitude and longitude.

In [53]:

            
                Copied!
                
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data

We can tell when our function runs by seeing it print the address.

In [54]:

            
                Copied!
                
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")

Geocoding 123 Peanut Street, Philadelphia, PA
Geocoding 3400 Walnut Road, Phoenix, AZ
Geocoding 123 Peanut Street, Philadelphia, PA
Done geocoding

In the example above, we see that it geocoded 123 Peanut Street twice, because it printed it out twice.

Using `@functools.lru_cache` to cache responses¶

@functools.lru_cache is a "decorator" that can be used with Python functions. It tells the function to remember its responses, so if you ask it for the same thing twice it won't have to re-run the code.

We will adjust the code above to include import functools at the top, and @fundtools.lru_cache before we declare our function.

In [55]:

            
                Copied!
                
                    
                    
                
                

        
import functools

@functools.lru_cache
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data
import functools

@functools.lru_cache
def geocode_address(address):
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return data

Let's run our code and see how the geocoding works.

In [56]:

            
                Copied!
                
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")
geocode_address('123 Peanut Street, Philadelphia, PA')
geocode_address('3400 Walnut Road, Phoenix, AZ')
geocode_address('123 Peanut Street, Philadelphia, PA')

print("Done geocoding")

Geocoding 123 Peanut Street, Philadelphia, PA
Geocoding 3400 Walnut Road, Phoenix, AZ
Done geocoding

Notice this time our geocoder only printed two addresses. This is because the second time the function sees 123 Peanut Street, Philadelphia, PA it remembers the previous answer.

Using `lru_cache` with dataframes¶

When you get into using pandas and dataframes, there's one think to watch out for with lru_cache.

Sometimes your dataframe has separate column for street address, city, and state. Then when you're geocoding, you use your function in order to create the address that you're sending to the geocoding service. Here is our sample dataframe:

In [71]:

            
                Copied!
                
                    
                    
                
                

        
df = pd.DataFrame([
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
    { 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df
df = pd.DataFrame([
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
    { 'street': '3400 Walnut Road', 'city': 'Phoenix', 'state': 'AZ'},
    { 'street': '123 Peanut Street', 'city': 'Philadelphia', 'state': 'PA'},
])
df

Out[71]:

	street	city	state
0	123 Peanut Street	Philadelphia	PA
1	3400 Walnut Road	Phoenix	AZ
2	123 Peanut Street	Philadelphia	PA

Now we'll build our geocoder. It's similar to our original one, but this one takes an entire row of data and then formats it into an address suitable for geocoding. It then returns a pd.Series so that we can easily combine it with our original dataframe.

In [72]:

            
                Copied!
                
                    
                    
                
                

        
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)

In [73]:

            
                Copied!
                
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)

Geocoding 123 Peanut Street Philadelphia PA
Geocoding 3400 Walnut Road Phoenix AZ
Geocoding 123 Peanut Street Philadelphia PA

Out[73]:

	street	city	state	lat	lng
0	123 Peanut Street	Philadelphia	PA	8.4024	-19.7705
1	3400 Walnut Road	Phoenix	AZ	-70.1524	-175.0852
2	123 Peanut Street	Philadelphia	PA	61.6642	166.9208

This works the same as before - 123 Peanut Street is geocoded twice because we aren't caching the results.

The problem with `lru_cache` and `df.apply`¶

You might think we can add @functools.lru_cache just like we did previously:

In [74]:

            
                Copied!
                
                    
                    
                
                

        
# This won't work!!!
@functools.lru_cache
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)
# This won't work!!!
@functools.lru_cache
def geocode_address(row):
    # Combine the columns into a string
    address = "{street} {city} {state}".format(**row)
    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)

In [ ]:

            
                Copied!
                
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)
# Geocode the addresses and merge it with the original data
results = df.apply(geocode_address, axis=1)
df.join(results)

Unfortunately, that won't work! If we try to run this code, though we end up with an TypeError: unhashable type: 'Series' error that looks like this:

TypeError                                 Traceback (most recent call last)
      1 # Geocode the addresses and merge it with the original data
      2 results = df.apply(geocode_address, axis=1)
      3 df.join(results)
    .....
    871             for i, v in enumerate(series_gen):
    872                 # ignore SettingWithCopy here in case the user mutates
    873                 results[i] = self.f(v)
    874                 if isinstance(results[i], ABCSeries):
    875                     # If we have a view on v, we need to make a copy because

TypeError: unhashable type: 'Series'

This is because lru_cache can only remember really basic kinds of data. Instead of just asking the geocoder to remember the address we're geocoding, we're asking it to remember every column of our row! It doesn't want to do that, so it gives an error.

The solution for `lru_cache` and `df.apply`¶

The solution is simple: only use lru_cache with single columns of data. If you have to build an address in your geocoder, build it outside of the function!

In [76]:

            
                Copied!
                
# Build an address column
df['address'] = df['street'] + ', ' + df['city'] + ', ' + df['state']
df
# Build an address column
df['address'] = df['street'] + ', ' + df['city'] + ', ' + df['state']
df

Out[76]:

	street	city	state	address
0	123 Peanut Street	Philadelphia	PA	123 Peanut Street, Philadelphia, PA
1	3400 Walnut Road	Phoenix	AZ	3400 Walnut Road, Phoenix, AZ
2	123 Peanut Street	Philadelphia	PA	123 Peanut Street, Philadelphia, PA

Now that we have an address column, that's all we need to send to our geocoding function.

In [79]:

            
                Copied!
                
                    
                    
                
                

        
@functools.lru_cache
def geocode_address(address):    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)
@functools.lru_cache
def geocode_address(address):    
    print("Geocoding", address)

    data = {
        'lat': round(random.random() * 180, 4) - 90,
        'lng': round(random.random() * 360, 4) - 180
    }
    
    # Return the result
    return pd.Series(data)

In [80]:

            
                Copied!
                
results = df.address.apply(geocode_address)
df.join(results)
results = df.address.apply(geocode_address)
df.join(results)

Geocoding 123 Peanut Street, Philadelphia, PA
Geocoding 3400 Walnut Road, Phoenix, AZ

Out[80]:

	street	city	state	address	lat	lng
0	123 Peanut Street	Philadelphia	PA	123 Peanut Street, Philadelphia, PA	56.1181	2.5285
1	3400 Walnut Road	Phoenix	AZ	3400 Walnut Road, Phoenix, AZ	3.4972	57.8574
2	123 Peanut Street	Philadelphia	PA	123 Peanut Street, Philadelphia, PA	56.1181	2.5285

Notice how it only prints out two addresses, even though we have three results. @functools.lru did a great job!

Using lru_cache to avoid geocoding duplicate addresses¶

Our sample geocoder¶

Using @functools.lru_cache to cache responses¶

Using lru_cache with dataframes¶

The problem with lru_cache and df.apply¶

The solution for lru_cache and df.apply¶

Using `@functools.lru_cache` to cache responses¶

Using `lru_cache` with dataframes¶

The problem with `lru_cache` and `df.apply`¶

The solution for `lru_cache` and `df.apply`¶