Fuzzy matching between datasets with large language models¶

Let's say we have two datasets: a list of NBA players with schools, and a list of college rankings. It might be nice to join the two, but sometimes the schools have slightly different names!

For example, the first few players went to schools like Northwestern Oklahoma and UCLA:

In [25]:

            
                Copied!
                
import pandas as pd

players = pd.read_csv("players.csv")
players.head()
import pandas as pd

players = pd.read_csv("players.csv")
players.head()

Out[25]:

	player_name	college
0	Randy Livingston	Louisiana State
1	Gaylon Nickerson	Northwestern Oklahoma
2	George Lynch	North Carolina
3	George McCloud	Florida State
4	George Zidek	UCLA

Northwestern Oklahoma in the player dataset might map to Northwestern Oklahoma State University...

In [28]:

            
                Copied!
                
schools[schools['institution.displayName'].str.contains("Northwestern Oklahoma")]
schools[schools['institution.displayName'].str.contains("Northwestern Oklahoma")]

Out[28]:

	institution.displayName	institution.schoolType	institution.state	institution.region	institution.isPublic	ranking.displayRank
990	Northwestern Oklahoma State University	regional-universities-west	OK	west	True	#86

...while UCLA becomes the much much much longer University of California, Los Angeles.

In [27]:

            
                Copied!
                
schools[schools['institution.displayName'].str.contains("Los Angeles")]
schools[schools['institution.displayName'].str.contains("Los Angeles")]

Out[27]:

	institution.displayName	institution.schoolType	institution.state	institution.region	institution.isPublic	ranking.displayRank
210	California State University--Los Angeles	regional-universities-west	CA	west	True	#21
268	Chicago School of Professional Psychology at L...	health	CA	NaN	True	Unranked
1466	University of California, Los Angeles	national-universities	CA	NaN	True	#15
1770	West Los Angeles College	regional-colleges-west	CA	west	True	Unranked

While Miami might be broken up into Ohio and Florida counterparts in the players database...

In [32]:

            
                Copied!
                
players[players['college'].str.contains("Miami", na=False)].head(4)
players[players['college'].str.contains("Miami", na=False)].head(4)

Out[32]:

	player_name	college
234	Ron Harper	Miami (OH)
605	Ron Harper	Miami (OH)
1084	Ron Harper	Miami (OH)
1439	Tim James	Miami (FL)

...which could map to about six thousand different options in the schools dataset!

In [16]:

            
                Copied!
                
schools[schools['institution.displayName'].str.contains("Miami")]
schools[schools['institution.displayName'].str.contains("Miami")]

Out[16]:

	institution.displayName	institution.schoolType	institution.state	institution.region	institution.isPublic	ranking.displayRank
16	Albizu University--Miami	national-universities	FL	NaN	True	Unranked
853	Miami Dade College	regional-colleges-south	FL	south	True	Unranked
854	Miami International University of Art & Design	regional-colleges-south	FL	south	True	#81
855	Miami Regional University	health	FL	NaN	True	Unranked
856	Miami University--Hamilton	regional-colleges-midwest	OH	midwest	True	#58
857	Miami University--Middletown	regional-colleges-midwest	OH	midwest	True	#58
858	Miami University--Oxford	national-universities	OH	NaN	True	#133
1538	University of Miami	national-universities	FL	NaN	True	#67

You could naively try to do fuzzy matches with a library like csvmatch or attempt to manually clean by standardizing "University of" or "X State College," but in the end school names are just too similar and requires background knoweldge about the world to actually match them with the correct option.

So you're going to do it manually, right?

No thanks! Let's accept a little margin of error and try it with an LLM. And not with just ChatGPT, either - let's automate this!

Get our list of options¶

We'll start by building a list of all of the options that the LLM can select from. We'll build it by creating getting every possible value of institution.displayName.

In [54]:

            
                Copied!
                
# Get a list of unique 
options = schools['institution.displayName'].unique().tolist()
options = tuple(options)

# Look at the last five
options[-5:]
# Get a list of unique 
options = schools['institution.displayName'].unique().tolist()
options = tuple(options)

# Look at the last five
options[-5:]

Out[54]:

('York College--CUNY',
 'York College of Pennsylvania',
 'York University',
 'Young Harris College',
 'Youngstown State University')

If we wanted to be more specific and provide more details, we could make a new column that includes the state name, or city, or any other additional information, like the code below.

schools['match_column'] = schools['institution.displayName'] + " in " + schools['institution.state']

You'd want to make sure it's well-formatted and looks like somewhat of a sentence. In the above case, we would end up with options like Miami University--Oxford in OH or University of Miami in FL.

I'm feeling a little free-wheeling, though, and trust that GPT will be able to figure it out without those extra details!

Designing the response¶

Now we're going to do install Instructor, a Python library that allows us to demand structured data from LLMs like ChatGPT or Claude. I looooove it! Instructor uses (simple?) Pydantic descriptions to detail what the LLM response should look like.

Since the process of matching all those schools might take a while we're also going to install tqdm for a nice pretty progress bar.

In [17]:

            
                Copied!
                
%pip install --quiet instructor pydantic tqdm
%pip install --quiet instructor pydantic tqdm

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Now we'll use Pydantic to describe what we want back from the LLM.

In [93]:

            
                Copied!
                
                    
                    
                
                

        
from pydantic import BaseModel, Field
from openai import OpenAI
from typing_extensions import Literal
from typing import Optional

class School(BaseModel):
    provided_name: str = Field(description="Verbatim provided name of school")
    cleaned_name: Optional[Literal[options]] = Field(description="Most likely or best guess of official name of school. When unsure, it's reasonable to guess large state universities.")
from pydantic import BaseModel, Field
from openai import OpenAI
from typing_extensions import Literal
from typing import Optional

class School(BaseModel):
    provided_name: str = Field(description="Verbatim provided name of school")
    cleaned_name: Optional[Literal[options]] = Field(description="Most likely or best guess of official name of school. When unsure, it's reasonable to guess large state universities.")

According to the above, for each school we'll be looking for two things:

the original name of the school
the match from the list of "official" school names we made above

When you adjust this code for yourself, you'll want to be sure to change both the names on the left and the descriptions on the right. Both are used by the LLM to understand what exactly you're looking for. In this case had to add a lot of encouragement in the description to get GPT to guess large, popular state schools when given names of states or cities.

Literal[options] means "pick one from the list of options," while the Optional[...] aorund it means if it really doesn't know it can skip it.

Now we'll connect Instructor to our LLM...

In [99]:

            
                Copied!
                
import instructor

# You'll need your own API key from https://platform.openai.com/account/api-keys
# (or just email me???)
client = instructor.from_openai(OpenAI(api_key='sk-proj-3N455VprkMyfPa0gYNKKT3BlbkFJ1vG6pSrJMXJub2adPpqy'))
import instructor

# You'll need your own API key from https://platform.openai.com/account/api-keys
# (or just email me???)
client = instructor.from_openai(OpenAI(api_key='sk-proj-3N455VprkMyfPa0gYNKKT3BlbkFJ1vG6pSrJMXJub2adPpqy'))

...and feed it an example to try it out.

In [101]:

            
                Copied!
                
                    
                    
                
                

        
school = "UCLA"

result = client.chat.completions.create(
    model="gpt-4o",
    response_model=School,
    messages=[{"role": "user", "content": school}],
    max_retries=3,
    temperature=0
)

result.model_dump()
school = "UCLA"

result = client.chat.completions.create(
    model="gpt-4o",
    response_model=School,
    messages=[{"role": "user", "content": school}],
    max_retries=3,
    temperature=0
)

result.model_dump()

Out[101]:

{'provided_name': 'UCLA',
 'cleaned_name': 'University of California, Los Angeles'}

Looks perfect!

We added max_retries=3 so that if the LLM attempts to respond with something that isn't in the list Instructor can give it a nudge (or two) to demand a better response.

In reality I ran about a half-dozen weird school names through this to make sure it got the tricky ones. We started with gpt-3.5-turbo (cheaper, dumber) but it was missing matches pretty frequently so we upgraded to gpt-4o (more expensive, slower) instead.

Building the pairing dataframe¶

While we could loop through every single player and get GPT's guess for the school match, a more cost-efficient approach is to take a unique list of player colleges. That way if we have sixty people who went to UCLA we'll only be asking about UCLA once!

This will require two merges later on - first players to pairing dataframe, then to the school rankings - but I think we'll survive.

In [86]:

            
                Copied!
                
# We'll also drop any missing/NaN schools while we're at it
colleges = players.college.dropna().unique()

print(len(players), "total players")
print(len(colleges), "unique schools")
# We'll also drop any missing/NaN schools while we're at it
colleges = players.college.dropna().unique()

print(len(players), "total players")
print(len(colleges), "unique schools")

12844 total players
356 unique schools

By only doing unique schools, we've removed about 97% of our potential requests! That's a lot of savings.

Now we'll loop through each one of those 356 schools and ask the LLM what it thinks the match should be. We'll then toss the results into a pretty little dataframe.

In [103]:

            
                Copied!
                
                    
                    
                
                

        
from tqdm import tqdm

# Keep track of all of our responses
results = []

for college in tqdm(colleges):

    # Ask for a response
    # gpt-3.5-turbo would have been much cheaper but it wasn't smart enough
    # temperature=0 so it doesn't get 'inventive'
    result = client.chat.completions.create(
        model="gpt-4o",
        response_model=School,
        messages=[{"role": "user", "content": college}],
        max_retries=3,
        temperature=0
    )
    
    # Save the original column just in case the
    # LLM rudely changes it before it sends it back
    match = result.model_dump()
    match['original'] = college

    # Save the response
    results.append(match)

matches = pd.json_normalize(results)
matches.head()
from tqdm import tqdm

# Keep track of all of our responses
results = []

for college in tqdm(colleges):

    # Ask for a response
    # gpt-3.5-turbo would have been much cheaper but it wasn't smart enough
    # temperature=0 so it doesn't get 'inventive'
    result = client.chat.completions.create(
        model="gpt-4o",
        response_model=School,
        messages=[{"role": "user", "content": college}],
        max_retries=3,
        temperature=0
    )
    
    # Save the original column just in case the
    # LLM rudely changes it before it sends it back
    match = result.model_dump()
    match['original'] = college

    # Save the response
    results.append(match)

matches = pd.json_normalize(results)
matches.head()

100%|██████████| 356/356 [06:22<00:00,  1.07s/it]

Out[103]:

	provided_name	cleaned_name	original
0	Louisiana State	Louisiana State University--Baton Rouge	Louisiana State
1	Northwestern Oklahoma	Northwestern Oklahoma State University	Northwestern Oklahoma
2	North Carolina	None	North Carolina
3	Florida State	Florida State University	Florida State
4	UCLA	University of California, Los Angeles	UCLA

If I didn't want to spend SO MUCH MONEY it would probably make more sense to filter down the list to potential matches, but... I didn't want to write more code.

Checking our results¶

If we want to see where it might have completely given up, we can look for places where the cleaned name - the match - is missing.

In [105]:

            
                Copied!
                
matches[matches['cleaned_name'].isna()]
matches[matches['cleaned_name'].isna()]

Out[105]:

	provided_name	cleaned_name	original
2	North Carolina	None	North Carolina
9	Ohio	None	Ohio
14	Indiana	None	Indiana
17	Oklahoma	None	Oklahoma
21	Washington	None	Washington
26	Wyoming	None	Wyoming
47	Virginia	None	Virginia
55	Pennsylvania	None	Pennsylvania
60	Kentucky	None	Kentucky
63	California	None	California
76	Tennessee	None	Tennessee
94	Florida	None	Florida
107	Trinity Valley Community College	None	Trinity Valley Community College
120	Minnesota	None	Minnesota
133	Massachusetts	None	Massachusetts
142	Texas	None	Texas
148	Seward County Community College	None	Seward County Community College
150	Missouri	None	Missouri
153	Wisconsin	None	Wisconsin
156	Mt. San Antonio	None	Mt. San Antonio
157	Southern	None	Southern
193	Hawaii	None	Hawaii
194	Barton Community College	None	Barton Community College
207	Butler Community College	None	Butler Community College
209	Venezuela	None	Venezuela
211	Indian Hills Community College	None	Indian Hills Community College
214	Blinn	None	Blinn
218	Northeast Mississippi Community College	None	Northeast Mississippi Community College
224	Idaho	None	Idaho
225	North Dakota	None	North Dakota
234	Southeastern Illinois	None	Southeastern Illinois
235	Yonsei (KOR)	None	Yonsei (KOR)
246	Meridian Community College	None	Meridian Community College
254	Robert Morris (IL)	None	Robert Morris (IL)
319	No College	None	No College
333	Delaware	None	Delaware
353	San Francisco	None	San Francisco
354	John A. Logan	None	John A. Logan

While we could provide more encouragement to have it guess, just think of this as the tiny tiny bit of manual labor you need to do to be a responsible adult!

You need to go through them manually to proofread anyway, so we'll save it to a CSV to open up in Excel or Google Sheets.

In [106]:

            
                Copied!
                
matches.to_csv("matches.csv", index=False)
matches.to_csv("matches.csv", index=False)

Combining our datasets¶

Let's assume I opened up matches.csv, found the matches for the missing schools and proofread the ones that did get matches (I did not do that. But assume!). Now it's time to join everything together!

We need to re-open matches.csv since we definitely edited the version we saved before. It was full of tiny mistakes!

In [107]:

            
                Copied!
                
matches = pd.read_csv("matches.csv")
matches.head()
matches = pd.read_csv("matches.csv")
matches.head()

Out[107]:

	provided_name	cleaned_name	original
0	Louisiana State	Louisiana State University--Baton Rouge	Louisiana State
1	Northwestern Oklahoma	Northwestern Oklahoma State University	Northwestern Oklahoma
2	North Carolina	NaN	North Carolina
3	Florida State	Florida State University	Florida State
4	UCLA	University of California, Los Angeles	UCLA

Now we merge! left_on and right_on are the column names in each dataframe, while how='left' makes sure to keep everything that doesn't have a match.

In [108]:

            
                Copied!
                
                    
                    
                
                

        
merged = players.merge(matches,
                       left_on='college',
                       right_on='original',
                       how='left')
merged = merged.merge(schools,
                      left_on='cleaned_name',
                      right_on='institution.displayName',
                      how='left')
merged.head()
merged = players.merge(matches,
                       left_on='college',
                       right_on='original',
                       how='left')
merged = merged.merge(schools,
                      left_on='cleaned_name',
                      right_on='institution.displayName',
                      how='left')
merged.head()

Out[108]:

	player_name	college	provided_name	cleaned_name	original	institution.displayName	institution.schoolType	institution.state	institution.region	institution.isPublic	ranking.displayRank	match_column
0	Randy Livingston	Louisiana State	Louisiana State	Louisiana State University--Baton Rouge	Louisiana State	Louisiana State University--Baton Rouge	national-universities	LA	NaN	True	#185	Louisiana State University--Baton Rouge in LA
1	Gaylon Nickerson	Northwestern Oklahoma	Northwestern Oklahoma	Northwestern Oklahoma State University	Northwestern Oklahoma	Northwestern Oklahoma State University	regional-universities-west	OK	west	True	#86	Northwestern Oklahoma State University in OK
2	George Lynch	North Carolina	North Carolina	NaN	North Carolina	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	George McCloud	Florida State	Florida State	Florida State University	Florida State	Florida State University	national-universities	FL	NaN	True	#53	Florida State University in FL
4	George Zidek	UCLA	UCLA	University of California, Los Angeles	UCLA	University of California, Los Angeles	national-universities	CA	NaN	True	#15	University of California, Los Angeles in CA

Now we can find anyone who is still missing a school, just in case something slipped through the cracks.

In [110]:

            
                Copied!
                
merged[merged['institution.displayName'].isna()].head()
merged[merged['institution.displayName'].isna()].head()

Out[110]:

	player_name	college	provided_name	cleaned_name	original	institution.displayName	institution.schoolType	institution.state	institution.region	institution.isPublic	ranking.displayRank	match_column
2	George Lynch	North Carolina	North Carolina	NaN	North Carolina	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	Gheorghe Muresan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	Gary Trent	Ohio	Ohio	NaN	Ohio	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	Greg Graham	Indiana	Indiana	NaN	Indiana	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	Harvey Grant	Oklahoma	Oklahoma	NaN	Oklahoma	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Some of those are fine - Gheorghe Muresan didn't go to college - but it looks like we need to go back in and edit our merged.csv to pop in Indiana, Ohio and a few other state.

While we can't solve everything with automation, but it did save us a bit of work!