Fuzzy matching between datasets with large language models¶
Let's say we have two datasets: a list of NBA players with schools, and a list of college rankings. It might be nice to join the two, but sometimes the schools have slightly different names!
For example, the first few players went to schools like Northwestern Oklahoma and UCLA:
import pandas as pd
players = pd.read_csv("players.csv")
players.head()
player_name | college | |
---|---|---|
0 | Randy Livingston | Louisiana State |
1 | Gaylon Nickerson | Northwestern Oklahoma |
2 | George Lynch | North Carolina |
3 | George McCloud | Florida State |
4 | George Zidek | UCLA |
Northwestern Oklahoma in the player dataset might map to Northwestern Oklahoma State University...
schools[schools['institution.displayName'].str.contains("Northwestern Oklahoma")]
institution.displayName | institution.schoolType | institution.state | institution.region | institution.isPublic | ranking.displayRank | |
---|---|---|---|---|---|---|
990 | Northwestern Oklahoma State University | regional-universities-west | OK | west | True | #86 |
...while UCLA becomes the much much much longer University of California, Los Angeles.
schools[schools['institution.displayName'].str.contains("Los Angeles")]
institution.displayName | institution.schoolType | institution.state | institution.region | institution.isPublic | ranking.displayRank | |
---|---|---|---|---|---|---|
210 | California State University--Los Angeles | regional-universities-west | CA | west | True | #21 |
268 | Chicago School of Professional Psychology at L... | health | CA | NaN | True | Unranked |
1466 | University of California, Los Angeles | national-universities | CA | NaN | True | #15 |
1770 | West Los Angeles College | regional-colleges-west | CA | west | True | Unranked |
While Miami might be broken up into Ohio and Florida counterparts in the players database...
players[players['college'].str.contains("Miami", na=False)].head(4)
player_name | college | |
---|---|---|
234 | Ron Harper | Miami (OH) |
605 | Ron Harper | Miami (OH) |
1084 | Ron Harper | Miami (OH) |
1439 | Tim James | Miami (FL) |
...which could map to about six thousand different options in the schools dataset!
schools[schools['institution.displayName'].str.contains("Miami")]
institution.displayName | institution.schoolType | institution.state | institution.region | institution.isPublic | ranking.displayRank | |
---|---|---|---|---|---|---|
16 | Albizu University--Miami | national-universities | FL | NaN | True | Unranked |
853 | Miami Dade College | regional-colleges-south | FL | south | True | Unranked |
854 | Miami International University of Art & Design | regional-colleges-south | FL | south | True | #81 |
855 | Miami Regional University | health | FL | NaN | True | Unranked |
856 | Miami University--Hamilton | regional-colleges-midwest | OH | midwest | True | #58 |
857 | Miami University--Middletown | regional-colleges-midwest | OH | midwest | True | #58 |
858 | Miami University--Oxford | national-universities | OH | NaN | True | #133 |
1538 | University of Miami | national-universities | FL | NaN | True | #67 |
You could naively try to do fuzzy matches with a library like csvmatch or attempt to manually clean by standardizing "University of" or "X State College," but in the end school names are just too similar and requires background knoweldge about the world to actually match them with the correct option.
So you're going to do it manually, right?
No thanks! Let's accept a little margin of error and try it with an LLM. And not with just ChatGPT, either - let's automate this!
Get our list of options¶
We'll start by building a list of all of the options that the LLM can select from. We'll build it by creating getting every possible value of institution.displayName
.
# Get a list of unique
options = schools['institution.displayName'].unique().tolist()
options = tuple(options)
# Look at the last five
options[-5:]
('York College--CUNY', 'York College of Pennsylvania', 'York University', 'Young Harris College', 'Youngstown State University')
If we wanted to be more specific and provide more details, we could make a new column that includes the state name, or city, or any other additional information, like the code below.
schools['match_column'] = schools['institution.displayName'] + " in " + schools['institution.state']
You'd want to make sure it's well-formatted and looks like somewhat of a sentence. In the above case, we would end up with options like Miami University--Oxford in OH or University of Miami in FL.
I'm feeling a little free-wheeling, though, and trust that GPT will be able to figure it out without those extra details!
Designing the response¶
Now we're going to do install Instructor, a Python library that allows us to demand structured data from LLMs like ChatGPT or Claude. I looooove it! Instructor uses (simple?) Pydantic descriptions to detail what the LLM response should look like.
Since the process of matching all those schools might take a while we're also going to install tqdm for a nice pretty progress bar.
%pip install --quiet instructor pydantic tqdm
[notice] A new release of pip is available: 23.0.1 -> 24.0 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Now we'll use Pydantic to describe what we want back from the LLM.
from pydantic import BaseModel, Field
from openai import OpenAI
from typing_extensions import Literal
from typing import Optional
class School(BaseModel):
provided_name: str = Field(description="Verbatim provided name of school")
cleaned_name: Optional[Literal[options]] = Field(description="Most likely or best guess of official name of school. When unsure, it's reasonable to guess large state universities.")
According to the above, for each school we'll be looking for two things:
- the original name of the school
- the match from the list of "official" school names we made above
When you adjust this code for yourself, you'll want to be sure to change both the names on the left and the descriptions on the right. Both are used by the LLM to understand what exactly you're looking for. In this case had to add a lot of encouragement in the description to get GPT to guess large, popular state schools when given names of states or cities.
Literal[options]
means "pick one from the list of options
," while the Optional[...]
aorund it means if it really doesn't know it can skip it.
Now we'll connect Instructor to our LLM...
import instructor
# You'll need your own API key from https://platform.openai.com/account/api-keys
# (or just email me???)
client = instructor.from_openai(OpenAI(api_key='sk-proj-3N455VprkMyfPa0gYNKKT3BlbkFJ1vG6pSrJMXJub2adPpqy'))
...and feed it an example to try it out.
school = "UCLA"
result = client.chat.completions.create(
model="gpt-4o",
response_model=School,
messages=[{"role": "user", "content": school}],
max_retries=3,
temperature=0
)
result.model_dump()
{'provided_name': 'UCLA', 'cleaned_name': 'University of California, Los Angeles'}
Looks perfect!
We added max_retries=3
so that if the LLM attempts to respond with something that isn't in the list Instructor can give it a nudge (or two) to demand a better response.
In reality I ran about a half-dozen weird school names through this to make sure it got the tricky ones. We started with gpt-3.5-turbo (cheaper, dumber) but it was missing matches pretty frequently so we upgraded to gpt-4o (more expensive, slower) instead.
Building the pairing dataframe¶
While we could loop through every single player and get GPT's guess for the school match, a more cost-efficient approach is to take a unique list of player colleges. That way if we have sixty people who went to UCLA we'll only be asking about UCLA once!
This will require two merges later on - first players to pairing dataframe, then to the school rankings - but I think we'll survive.
# We'll also drop any missing/NaN schools while we're at it
colleges = players.college.dropna().unique()
print(len(players), "total players")
print(len(colleges), "unique schools")
12844 total players 356 unique schools
By only doing unique schools, we've removed about 97% of our potential requests! That's a lot of savings.
Now we'll loop through each one of those 356 schools and ask the LLM what it thinks the match should be. We'll then toss the results into a pretty little dataframe.
from tqdm import tqdm
# Keep track of all of our responses
results = []
for college in tqdm(colleges):
# Ask for a response
# gpt-3.5-turbo would have been much cheaper but it wasn't smart enough
# temperature=0 so it doesn't get 'inventive'
result = client.chat.completions.create(
model="gpt-4o",
response_model=School,
messages=[{"role": "user", "content": college}],
max_retries=3,
temperature=0
)
# Save the original column just in case the
# LLM rudely changes it before it sends it back
match = result.model_dump()
match['original'] = college
# Save the response
results.append(match)
matches = pd.json_normalize(results)
matches.head()
100%|██████████| 356/356 [06:22<00:00, 1.07s/it]
provided_name | cleaned_name | original | |
---|---|---|---|
0 | Louisiana State | Louisiana State University--Baton Rouge | Louisiana State |
1 | Northwestern Oklahoma | Northwestern Oklahoma State University | Northwestern Oklahoma |
2 | North Carolina | None | North Carolina |
3 | Florida State | Florida State University | Florida State |
4 | UCLA | University of California, Los Angeles | UCLA |
If I didn't want to spend SO MUCH MONEY it would probably make more sense to filter down the list to potential matches, but... I didn't want to write more code.
Checking our results¶
If we want to see where it might have completely given up, we can look for places where the cleaned name - the match - is missing.
matches[matches['cleaned_name'].isna()]
provided_name | cleaned_name | original | |
---|---|---|---|
2 | North Carolina | None | North Carolina |
9 | Ohio | None | Ohio |
14 | Indiana | None | Indiana |
17 | Oklahoma | None | Oklahoma |
21 | Washington | None | Washington |
26 | Wyoming | None | Wyoming |
47 | Virginia | None | Virginia |
55 | Pennsylvania | None | Pennsylvania |
60 | Kentucky | None | Kentucky |
63 | California | None | California |
76 | Tennessee | None | Tennessee |
94 | Florida | None | Florida |
107 | Trinity Valley Community College | None | Trinity Valley Community College |
120 | Minnesota | None | Minnesota |
133 | Massachusetts | None | Massachusetts |
142 | Texas | None | Texas |
148 | Seward County Community College | None | Seward County Community College |
150 | Missouri | None | Missouri |
153 | Wisconsin | None | Wisconsin |
156 | Mt. San Antonio | None | Mt. San Antonio |
157 | Southern | None | Southern |
193 | Hawaii | None | Hawaii |
194 | Barton Community College | None | Barton Community College |
207 | Butler Community College | None | Butler Community College |
209 | Venezuela | None | Venezuela |
211 | Indian Hills Community College | None | Indian Hills Community College |
214 | Blinn | None | Blinn |
218 | Northeast Mississippi Community College | None | Northeast Mississippi Community College |
224 | Idaho | None | Idaho |
225 | North Dakota | None | North Dakota |
234 | Southeastern Illinois | None | Southeastern Illinois |
235 | Yonsei (KOR) | None | Yonsei (KOR) |
246 | Meridian Community College | None | Meridian Community College |
254 | Robert Morris (IL) | None | Robert Morris (IL) |
319 | No College | None | No College |
333 | Delaware | None | Delaware |
353 | San Francisco | None | San Francisco |
354 | John A. Logan | None | John A. Logan |
While we could provide more encouragement to have it guess, just think of this as the tiny tiny bit of manual labor you need to do to be a responsible adult!
You need to go through them manually to proofread anyway, so we'll save it to a CSV to open up in Excel or Google Sheets.
matches.to_csv("matches.csv", index=False)
Combining our datasets¶
Let's assume I opened up matches.csv
, found the matches for the missing schools and proofread the ones that did get matches (I did not do that. But assume!). Now it's time to join everything together!
We need to re-open matches.csv
since we definitely edited the version we saved before. It was full of tiny mistakes!
matches = pd.read_csv("matches.csv")
matches.head()
provided_name | cleaned_name | original | |
---|---|---|---|
0 | Louisiana State | Louisiana State University--Baton Rouge | Louisiana State |
1 | Northwestern Oklahoma | Northwestern Oklahoma State University | Northwestern Oklahoma |
2 | North Carolina | NaN | North Carolina |
3 | Florida State | Florida State University | Florida State |
4 | UCLA | University of California, Los Angeles | UCLA |
Now we merge! left_on
and right_on
are the column names in each dataframe, while how='left'
makes sure to keep everything that doesn't have a match.
merged = players.merge(matches,
left_on='college',
right_on='original',
how='left')
merged = merged.merge(schools,
left_on='cleaned_name',
right_on='institution.displayName',
how='left')
merged.head()
player_name | college | provided_name | cleaned_name | original | institution.displayName | institution.schoolType | institution.state | institution.region | institution.isPublic | ranking.displayRank | match_column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Randy Livingston | Louisiana State | Louisiana State | Louisiana State University--Baton Rouge | Louisiana State | Louisiana State University--Baton Rouge | national-universities | LA | NaN | True | #185 | Louisiana State University--Baton Rouge in LA |
1 | Gaylon Nickerson | Northwestern Oklahoma | Northwestern Oklahoma | Northwestern Oklahoma State University | Northwestern Oklahoma | Northwestern Oklahoma State University | regional-universities-west | OK | west | True | #86 | Northwestern Oklahoma State University in OK |
2 | George Lynch | North Carolina | North Carolina | NaN | North Carolina | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | George McCloud | Florida State | Florida State | Florida State University | Florida State | Florida State University | national-universities | FL | NaN | True | #53 | Florida State University in FL |
4 | George Zidek | UCLA | UCLA | University of California, Los Angeles | UCLA | University of California, Los Angeles | national-universities | CA | NaN | True | #15 | University of California, Los Angeles in CA |
Now we can find anyone who is still missing a school, just in case something slipped through the cracks.
merged[merged['institution.displayName'].isna()].head()
player_name | college | provided_name | cleaned_name | original | institution.displayName | institution.schoolType | institution.state | institution.region | institution.isPublic | ranking.displayRank | match_column | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | George Lynch | North Carolina | North Carolina | NaN | North Carolina | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | Gheorghe Muresan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | Gary Trent | Ohio | Ohio | NaN | Ohio | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | Greg Graham | Indiana | Indiana | NaN | Indiana | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
19 | Harvey Grant | Oklahoma | Oklahoma | NaN | Oklahoma | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Some of those are fine - Gheorghe Muresan didn't go to college - but it looks like we need to go back in and edit our merged.csv
to pop in Indiana, Ohio and a few other state.
While we can't solve everything with automation, but it did save us a bit of work!