import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
pd.options.display.max_columns = 30
%matplotlib inline
Analyzing text!
Text analysis has a few parts. We are going to use bag of words analysis, which just treats a sentence like a bag of words - no particular order or anything. It’s simple but it usually gets the job done adequately.
Here is our text.
texts = [
"Penny bought bright blue fishes.",
"Penny bought bright blue and orange fish.",
"The cat ate a fish at the store.",
"Penny went to the store. Penny ate a bug. Penny saw a fish.",
"It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.",
"The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
"Penny is a fish"
]
When you process text you have a nice long series of steps, but if we want to be general we’re interested in three things:
- Tokenizing converts all of the sentences/phrases/etc into a series of words, and then it might also include converting it into a series of numbers - math stuff only works with numbers, not words. So maybe ‘cat’ is 2 and ‘rug’ is 4 and stuff like that.
- Counting takes those words and sees how many there are (obviously) - how
many times does
meow
appear? - Normalizing takes the count and makes new numbers - maybe it’s how many
times
meow
appears vs. how many total words there are, or maybe you’re seeing how oftenmeow
comes up to see whether it’s important.
Why tokenizing and counting are difficult
Tokenizing and counting are kind of funny, because they seem like they’d be
simple. If we have a sentence and want to figure out how many times some word
appears in it, why do we need some fancy library? Can’t we just use .count
on
our strings?
Let’s count the number of times “store” appears in the following sentence.
"I went to the store today, but the store was closed".count("store")
2
Seems simple enough, and it seems like it worked, but it’s a trick. Let’s find out how many times “can” shows up in the following sentence.
"The toucan doesn't like pelicans".count("can")
2
Oof! .count
tells you whether the string shows up anywhere, not the word.
So if “can” is in “toucan” or “pelican” .count
still counts it.
“But!” you exclaim. “What if we split the string and use .count
for the
array?” You’d be right! …kind of.
# .split() or .split(" ") will separate a string into a list of words
"The toucan doesn't like pelicans".split(" ")
['The', 'toucan', "doesn't", 'like', 'pelicans']
# Then we can ask whether that list contains 'can'
"The toucan doesn't like pelicans".split(" ").count("can")
0
# Or 'toucan'
"The toucan doesn't like pelicans".split(" ").count("toucan")
1
This looks better and works great for that example, but once I start throwing punctuation in things get hairy.
"What about the mouse!!!!!".split(" ")
['What', 'about', 'the', 'mouse!!!!!']
Notice how instead of "mouse"
we have "mouse!!!!!"
? That’s going to be a
problem, since we won’t be able to get an exact match.
"What about the mouse!!!!!".split(" ").count("mouse")
0
"What about the mouse!!!!!".split(" ").count("mouse!!!!!")
1
Sure, we might be able to get rid of the punctuation… and then later, make everything lowercase so “Mouse” and “mouse” are the same… and this, and that, and a thousand other things. That’s why we rely on library!
Unfortunately there are about ten thousand libraries, each with their own strengths and weaknesses. For now we’re going to be using scikit- learn, which is a machine learning library. We like it because it plugs into pandas very easily.
Penny and the fishes
texts = [
"Penny bought bright blue fishes.",
"Penny bought bright blue and orange fish.",
"The cat ate a fish at the store.",
"Penny went to the store. Penny ate a bug. Penny saw a fish.",
"It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.",
"The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
"Penny is a fish"
]
"Penny bought bright blue fishes".split()
['Penny', 'bought', 'bright', 'blue', 'fishes']
Penny bought bright blue fishes.
If we want to tokenize that sentence, we’d like to lowercase it, removing
the punctuation and split on spaces - penny bought bright blue fishes
. It also
works for other languages:
у меня зазвонил телефон
That’s Russian for “my phone is ringing.” It works just as well with the tokenizer we used for English - lowercase it, remove punctuation, split on spaces. No big deal!
私はえんぴつです。
This is Japanese for “I am a pencil.” It doesn’t work with our tokenizer,
since it doesn’t have spaces. You don’t treat every character separately, either
- 私
and は
are their own thing, but えんぴつ
means “pencil” and です
is “to
be.”
“Eastern” languages need special tokenizing (and usually other treatment) when doing text analysis, mostly because they don’t have spaces. They’re collectively referred to as “CJK” languages, for Chinese, Japanese and Korean. It includes languages outside of those three, too, as long as they don’t adhere to the “just make it lowercase and split on spaces” rules. You’ll need to track down special tokenizers if you’re working with those languages.
Getting to work
The scikit-learn
package does a ton of stuff, some of which includes the
above. We’re going to start by playing with the CountVectorizer
, which helps
us tokenize and count.
“What is vectorizing?!?!” You might (understandable) exclaim, hands formed into claws. It’s just a stupid technical word that mean “turning words into numbers.” If you’re thinking about worrying about it, don’t worry about it.
First we need to create a CountVectorizer
to do the work for us. It seems
like a kind of useless line of code, but it’s just because we aren’t giving
CountVectorizer
any options just yet.
# Import the CountVectorizer code from scikit-learn
# And create a new vectorizer that will do our counting later
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
Now we’ll put it to use by using the .fit_transform
method. This basically
means “figure out the words in the texts and count them all.”
# .fit_transfer TOKENIZES and COUNTS
matrix = vectorizer.fit_transform(texts)
Let’s take a look at what it found out!
matrix
<7x23 sparse matrix of type '<class 'numpy.int64'>'
with 49 stored elements in Compressed Sparse Row format>
Okay, that looks like trash and garbage. What’s a “sparse array”??????
matrix.toarray()
array([[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 1, 0, 1, 1, 1, 1],
[1, 3, 0, 0, 0, 0, 1, 0, 3, 0, 1, 3, 2, 1, 1, 0, 0, 0, 1, 0, 4, 0, 0],
[0, 2, 0, 0, 0, 0, 0, 3, 2, 0, 3, 0, 0, 1, 0, 1, 0, 0, 0, 1, 5, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]], dtype=int64)
This is going to take a little imagination, but bear with me!
Each one of those rows is one of our texts. The first row represents our first sentence, the second row represents our second sentence, and so on.
sentence | representation |
---|---|
Penny bought bright blue fishes. | `[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, |
0, 1, 0, 0, 0, 0, 0, 0]` | |
Penny bought bright blue and orange fish. | `[1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, |
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]` | |
The cat ate a fish at the store | `[0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, |
0, 0, 0, 0, 1, 2, 0, 0]` | |
…etc | …etc |
Each number is the number of appearances of a possible word. We don’t know what the words they’re counting are just yet, but we know the first sentence has 0 of the first word, 0 of the second word, 0 of the third word, and 1 of words four, five and six. If we compare it to the second sentence, we see they have a lot of the same numbers in the columns - that means they’re similar sentences!
But let’s be honest: we can’t read that. It would look nicer as a dataframe.
So, well, let’s take the matrix
and make it a dataframe.
# You get an error doing pd.DataFrame(matrix), you need the .toarray() part
pd.DataFrame(matrix.toarray())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 1 | 1 | 1 |
4 | 1 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 3 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 4 | 0 | 0 |
5 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 5 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Each row is one of our sentences, sure, but what do all of those numbers mean???? I want to know what word number 0 is, and word number 1, and word number 2, and so on!
Luckily when we used our vectorizer it remembered all of the words, so we can take a peek at them. When you’re doing machine learning, anything interesting about an object is called a feature. It’s interesting that these sentences have this word or that word, so they’re called features.
print(vectorizer.get_feature_names())
['and', 'at', 'ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'is', 'it', 'meowed', 'meowing', 'once', 'orange', 'penny', 'saw', 'still', 'store', 'the', 'to', 'went']
Because we’re excellent programmers, we know that we can combine the last two snippets of code, so we’ll
- create a dataframe of our matrix, while also
- specifying column names using our vocabulary list
and then we’ll have a beautiful dataframe of all of our word counts, with very nice titles.
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
and | at | ate | blue | bought | bright | bug | cat | fish | fishes | is | it | meowed | meowing | once | orange | penny | saw | still | store | the | to | went | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 1 | 1 | 1 |
4 | 1 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 3 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 4 | 0 | 0 |
5 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 5 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
So sentence #4 has “at” twice, and the first sentence has “bought” once, and the fifth sentence has “the” five times.
Hmmm, ‘bought’ might be interesting, but “the” sure isn’t. Neither is “and” or “to,” I think. Who cares about boring words like that? Not me! I just think they’re cluttering up our dataframe.
We can actually have the vectorizer totally ignore them. Words you ignore are called stopwords, and they’re a common way to simplify your text analysis. Most text analysis software comes with prebuilt lists of stopwords - we’re just going to tell our vectorizer to use the standard list for Engish.
# We'll make a new vectorizer
vectorizer = CountVectorizer(stop_words='english')
# Find all the words and count them
matrix =vectorizer.fit_transform(texts)
# And let's look at the feature names
print(vectorizer.get_feature_names())
['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'meowed', 'meowing', 'orange', 'penny', 'saw', 'store', 'went']
Looks good! See, no more “the” or “and” - life is perfect now. Let’s make another dataframe!
# We've done this before!
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
ate | blue | bought | bright | bug | cat | fish | fishes | meowed | meowing | orange | penny | saw | store | went | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 1 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Lemmatizing and stemming
I do have an issue with this, though - I see separate entries for meowed
and
meowing
and fish
and fishes
, even though they kind of seem like the same
words to me! I mean, they kind of are the same words: they’re just different
forms.
To do that kind of simplification, you can either stem or lemmatize your words. Stemming is a lazy way of chopping off endings, while lemmatization means you looked the word up in the dictionary to find out what the simple form of the word should be.
You can tell the difference with something like “running” vs “ran.”
word | stem | lemma |
---|---|---|
runner | run | run |
running | run | run |
ran | ran | run |
Lemmas are better than stems, but they also take a lot more work and time! So how are we gonna do this? We pick another library!
TextBlob: Some people use TextBlob for lemmatizing. We ain’t gonna do it! It’s a nice library, but by default it thinks everything is a noun. That makes for bad lemmatizing, because it needs to understand that “running” is a verb to make it into “run.”
spaCy: We’re going to use spaCy instead. It’s like a cool new hip NLTK. It’s kind of weird until you get used to it, though.
spaCy: A weird but good text analysis thingie
spaCy is a dream, but a dream where sometimes your legs won’t move right and you can’t read text. But sometimes you can fly! So yes, as always, ups and downs.
import spacy
# Create a spacy natural-language processor for English
nlp = spacy.load('en')
Didn’t that take forever to load? I’d like to think it’s because it’s SO SMART. spaCy knows an insane amount of information about each token in your text. Things like the lemma, detailed part-of-speech, whether it’s part of a stopword list, punctuation etc. See a full list here and see an example of a few below!
doc = nlp("It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.")
print("Sentence: ", doc, "\n")
for token in doc[:7]:
print(" original |", token.orth_)
print(" lemma |", token.lemma_)
print(" is punctuation? |", token.is_punct)
print(" is numeric? |", token.is_digit)
print(" is a stopword? |", token.is_stop)
print(" part of speech |", token.pos_)
print(" detailed POS |", token.tag_)
print("")
Sentence: It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.
original | It
lemma | -PRON-
is punctuation? | False
is numeric? | False
is a stopword? | True
part of speech | PRON
detailed POS | PRP
original | meowed
lemma | meow
is punctuation? | False
is numeric? | False
is a stopword? | False
part of speech | VERB
detailed POS | VBD
original | once
lemma | once
is punctuation? | False
is numeric? | False
is a stopword? | True
part of speech | ADV
detailed POS | RB
original | at
lemma | at
is punctuation? | False
is numeric? | False
is a stopword? | True
part of speech | ADP
detailed POS | IN
original | the
lemma | the
is punctuation? | False
is numeric? | False
is a stopword? | True
part of speech | DET
detailed POS | DT
original | fish
lemma | fish
is punctuation? | False
is numeric? | False
is a stopword? | False
part of speech | NOUN
detailed POS | NN
original | ,
lemma | ,
is punctuation? | True
is numeric? | False
is a stopword? | False
part of speech | PUNCT
detailed POS | ,
Remember tokenization, where you break the sentence into words? By default spaCy includes punctuation in its list of tokens - notice the commas and periods that are included in the following.
tokens = [token for token in doc]
print(tokens)
[It, meowed, once, at, the, fish, ,, it, is, still, meowing, at, the, fish, ., It, meowed, at, the, bug, and, the, fish, .]
Aside: Why are we using a list comprehension? Because we have to. Because spaCy is weird.
Since we don’t like the punctuation in our list of tokens, we can use an if
condition in a list comprehension to say hey, we don’t want things that are
punctuation!
tokens = [token for token in doc if not token.is_punct]
print(tokens)
[It, meowed, once, at, the, fish, it, is, still, meowing, at, the, fish, It, meowed, at, the, bug, and, the, fish]
And then maybe we can convert them to lemmas…
lemmas = [token.lemma_ for token in tokens]
print(lemmas)
['-PRON-', 'meow', 'once', 'at', 'the', 'fish', '-PRON-', 'be', 'still', 'meow', 'at', 'the', 'fish', '-PRON-', 'meow', 'at', 'the', 'bug', 'and', 'the', 'fish']
spaCy decided that there’s no basic form for pronouns, so it just replaced them
all with -PRON-
. I don’t know how I feel about this, so I’m going to go ahead
and make an insane list comprehension to keep the original version if it’s a
pronoun.
lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
print(lemmas)
['It', 'meow', 'once', 'at', 'the', 'fish', 'it', 'be', 'still', 'meow', 'at', 'the', 'fish', 'It', 'meow', 'at', 'the', 'bug', 'and', 'the', 'fish']
Since we’ve made a lot of changes, let’s put all of the code in one place to see how spaCy works.
import spacy
# Create a spacy natural-language processor for English
# This usually takes a little bit
nlp = spacy.load('en')
# Process our text with the spaCy processor
doc = nlp("It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.")
# Turn it into tokens, ignoring the punctuation
tokens = [token for token in doc if not token.is_punct]
# Convert those tokens into lemmas, EXCEPT the pronouns, we'll keep those.
lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
print(lemmas)
['It', 'meow', 'once', 'at', 'the', 'fish', 'it', 'be', 'still', 'meow', 'at', 'the', 'fish', 'It', 'meow', 'at', 'the', 'bug', 'and', 'the', 'fish']
Lemmatizing with a vectorizer
Now that we’ve figured out how to turn a sentence into lemmas, we need to put it
to use with our CountVectorizer
. Luckily we can just say “hey, vectorizer, we
made a thingie that breaks the text into pieces, please use it instead of
whatever you already use” and it will say “sounds great I will use that.”
def lemmatize(text):
doc = nlp(text)
# Turn it into tokens, ignoring the punctuation
tokens = [token for token in doc if not token.is_punct]
# Convert those tokens into lemmas, EXCEPT the pronouns, we'll keep those.
lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
return lemmas
We’ll make a new vectorizer, but now anytime it wants to break down text, it will send the text to our ‘lemmatize’ function that we wrote above.
vectorizer = CountVectorizer(stop_words='english', tokenizer=lemmatize)
# Find all the words and count them
matrix = vectorizer.fit_transform(texts)
matrix
<7x11 sparse matrix of type '<class 'numpy.int64'>'
with 30 stored elements in Compressed Sparse Row format>
What words did we get back?
print(vectorizer.get_feature_names())
['blue', 'bright', 'bug', 'buy', 'cat', 'eat', 'fish', 'meow', 'orange', 'penny', 'store']
Looks great! While we’re at it, can we see it as a dataframe?
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 1 | 1 | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Beautiful.
TF-IDF
Part One: Term Frequency
TF-IDF? What? It means term frequency inverse document frequency! (of course!) TF-IDF is the most important thing in history, so we should probably learn it.
Let’s look at our list of phrases:
- Penny bought bright blue fishes.
- Penny bought bright blue and orange fish.
- The cat ate a fish at the store.
- Penny went to the store. Penny ate a bug. Penny saw a fish.
- It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.
- The cat is fat. The cat is orange. The cat is meowing at the fish.
- Penny is a fish
If we’re searching for the word fish
, which is the most helpful phrase?
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 1 | 1 | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Probably the one where fish
appears three times.
It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.
So yeah, that clearly wins. But what about other sentences that talk about fish? Which one is more about fish?
Penny is a fish.
Penny went to the store. Penny ate a bug. Penny saw a fish.
Hmmm! They both have “fish” once, but the first one only talks about fish, while the second one talks about a bunch of other stuff, too. I would probably say the first one is more about fish!
Think about a huge long document where they say your name once, versus a tweet where they say your name once. Which one are you more important in? Probably the tweet, since you take up a larger percentage of the text!
Tada: this idea is called term frequency - taking into account how often a term shows up vs how big the text is.
Instead of just COUNTing the number of times a word shows up in a text, we’re now going to divide it by the total number of words. So with “Penny is a fish,” “fish” only shows up once but it’s 25% of the text.
To do this we’re going to use scikit-learn’s TfidfVectorizer
. Don’t fret, it’s
almost exactly the same as the CountVectorizer
, it just does that division for
you!
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=lemmatize, use_idf=False, norm='l1')
In terms of options, we’re giving our TfidfVectorizer a handful:
stop_words='english'
to get ignore words like ‘and’ and ‘the’tokenizer=lemmatize
to have it lemmatize the words using the function we wrote up aboveuse_idf=False
to keep it only using TF (term frequency), not IDF (inverse document frequency), which we’ll talk about laternorm='l1'
so it only does simple division - X appearances divided by Y words
Now let’s use it!
matrix = tfidf_vectorizer.fit_transform(texts)
pd.DataFrame(matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.200000 | 0.200000 | 0.000000 | 0.200000 | 0.000 | 0.000000 | 0.200000 | 0.000000 | 0.000000 | 0.200000 | 0.000000 |
1 | 0.166667 | 0.166667 | 0.000000 | 0.166667 | 0.000 | 0.000000 | 0.166667 | 0.000000 | 0.166667 | 0.166667 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.250 | 0.250000 | 0.250000 | 0.000000 | 0.000000 | 0.000000 | 0.250000 |
3 | 0.000000 | 0.000000 | 0.142857 | 0.000000 | 0.000 | 0.142857 | 0.142857 | 0.000000 | 0.000000 | 0.428571 | 0.142857 |
4 | 0.000000 | 0.000000 | 0.142857 | 0.000000 | 0.000 | 0.000000 | 0.428571 | 0.428571 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.375 | 0.000000 | 0.250000 | 0.125000 | 0.125000 | 0.000000 | 0.125000 |
6 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 |
Now our numbers have shifted a little bit. Instead of just being a count, it’s the percentage of the words.
value = (number of times word appears in sentence) / (number of words in sentence)
After we remove the stopwords, the term fish
is 50% of the words in Penny is
a fish
vs. 37.5% in It meowed once at the fish, it is still meowing at the
fish. It meowed at the bug and the fish.
.
Note: We made it be the percentage of the words by passing in
norm="l1"
- by default it’s normally an L2 (Euclidean) norm, which is actually better, but I thought it would make more sense using the L1 - a.k.a. terms divided by words -norm.
So now when we search we’ll get more relevant results because it takes into
account whether half of our words are fish
or 1% of millions upon millions of
words is fish
. But we aren’t done yet!
Part Two: Inverse document frequency
Let’s say we’re searching for “fish meow orange,” and want to find the most relevant document.
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=lemmatize, use_idf=False, norm='l1')
matrix = tfidf_vectorizer.fit_transform(texts)
df = pd.DataFrame(matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
df
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.200000 | 0.200000 | 0.000000 | 0.200000 | 0.000 | 0.000000 | 0.200000 | 0.000000 | 0.000000 | 0.200000 | 0.000000 |
1 | 0.166667 | 0.166667 | 0.000000 | 0.166667 | 0.000 | 0.000000 | 0.166667 | 0.000000 | 0.166667 | 0.166667 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.250 | 0.250000 | 0.250000 | 0.000000 | 0.000000 | 0.000000 | 0.250000 |
3 | 0.000000 | 0.000000 | 0.142857 | 0.000000 | 0.000 | 0.142857 | 0.142857 | 0.000000 | 0.000000 | 0.428571 | 0.142857 |
4 | 0.000000 | 0.000000 | 0.142857 | 0.000000 | 0.000 | 0.000000 | 0.428571 | 0.428571 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.375 | 0.000000 | 0.250000 | 0.125000 | 0.125000 | 0.000000 | 0.125000 |
6 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 |
How do we know which is the best for “fish meow?” Should we just add up who has the highest combined “fish” and “meow?”
# Just add the columns together
added = pd.DataFrame([df['fish'], df['meow'], df['orange'], df['fish'] + df['meow'] + df['orange']], index=["fish", "meow", "orange", "fish + meow + orange"]).T
added['text'] = texts
added
fish | meow | orange | fish + meow + orange | text | |
---|---|---|---|---|---|
0 | 0.200000 | 0.000000 | 0.000000 | 0.200000 | Penny bought bright blue fishes. |
1 | 0.166667 | 0.000000 | 0.166667 | 0.333333 | Penny bought bright blue and orange fish. |
2 | 0.250000 | 0.000000 | 0.000000 | 0.250000 | The cat ate a fish at the store. |
3 | 0.142857 | 0.000000 | 0.000000 | 0.142857 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.428571 | 0.428571 | 0.000000 | 0.857143 | It meowed once at the fish, it is still meowin... |
5 | 0.250000 | 0.125000 | 0.125000 | 0.500000 | The cat is at the fish store. The cat is orang... |
6 | 0.500000 | 0.000000 | 0.000000 | 0.500000 | Penny is a fish |
Number 4 wins by far, but 5 and 6 even, both at 0.5. The sentence are these:
The cat is at the fish store. The cat is orange. The cat is meowing at the fish.
Penny is a fish.
Seems like BS to me! The second sentence doesn’t even have the words meow or orange.
It seems like since fish
shows up again and again it should be weighted a
little less - not like it’s a stopword, but just… it’s kind of cliche to
have it show up in the text, so we want to make it less important. And since
“orange” is super rare, the computer should be really excited about finding it
somewhere.
This idea is called inverse document frequency - the more often a term shows up across all documents, the less important it is in our matrix. So TFIDF is:
- Term frequency: Does it show up a lot in our document? It’s important!
- Inverse document frequency: Does it show up a lot in all our documents? It’s boring!
# use_idf=True is default, but I'll leave it in
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=lemmatize, use_idf=True, norm='l1')
matrix = tfidf_vectorizer.fit_transform(texts)
tfidf_df = pd.DataFrame(matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
tfidf_df
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.235463 | 0.235463 | 0.000000 | 0.235463 | 0.000000 | 0.000000 | 0.118871 | 0.000000 | 0.000000 | 0.174741 | 0.000000 |
1 | 0.190587 | 0.190587 | 0.000000 | 0.190587 | 0.000000 | 0.000000 | 0.096216 | 0.000000 | 0.190587 | 0.141437 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.297654 | 0.297654 | 0.150267 | 0.000000 | 0.000000 | 0.000000 | 0.254425 |
3 | 0.000000 | 0.000000 | 0.179021 | 0.000000 | 0.000000 | 0.179021 | 0.090377 | 0.000000 | 0.000000 | 0.398562 | 0.153021 |
4 | 0.000000 | 0.000000 | 0.181340 | 0.000000 | 0.000000 | 0.000000 | 0.274642 | 0.544019 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.437035 | 0.000000 | 0.147088 | 0.145678 | 0.145678 | 0.000000 | 0.124521 |
6 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.404858 | 0.000000 | 0.000000 | 0.595142 | 0.000000 |
Let’s compare our OLD values with our NEW values
# OLD: Just TF
added = pd.DataFrame([df['fish'], df['meow'], df['orange'], df['fish'] + df['meow'] + df['orange']], index=["fish", "meow", "orange", "fish + meow + orange"]).T
added['text'] = texts
added
fish | meow | orange | fish + meow + orange | text | |
---|---|---|---|---|---|
0 | 0.200000 | 0.000000 | 0.000000 | 0.200000 | Penny bought bright blue fishes. |
1 | 0.166667 | 0.000000 | 0.166667 | 0.333333 | Penny bought bright blue and orange fish. |
2 | 0.250000 | 0.000000 | 0.000000 | 0.250000 | The cat ate a fish at the store. |
3 | 0.142857 | 0.000000 | 0.000000 | 0.142857 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.428571 | 0.428571 | 0.000000 | 0.857143 | It meowed once at the fish, it is still meowin... |
5 | 0.250000 | 0.125000 | 0.125000 | 0.500000 | The cat is at the fish store. The cat is orang... |
6 | 0.500000 | 0.000000 | 0.000000 | 0.500000 | Penny is a fish |
# NEW: TF-IDF
added = pd.DataFrame([tfidf_df['fish'], tfidf_df['meow'], tfidf_df['orange'], tfidf_df['fish'] + tfidf_df['meow'] + tfidf_df['orange']], index=["fish", "meow", "orange", "fish + meow + orange"]).T
added['text'] = texts
added
fish | meow | orange | fish + meow + orange | text | |
---|---|---|---|---|---|
0 | 0.118871 | 0.000000 | 0.000000 | 0.118871 | Penny bought bright blue fishes. |
1 | 0.096216 | 0.000000 | 0.190587 | 0.286802 | Penny bought bright blue and orange fish. |
2 | 0.150267 | 0.000000 | 0.000000 | 0.150267 | The cat ate a fish at the store. |
3 | 0.090377 | 0.000000 | 0.000000 | 0.090377 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.274642 | 0.544019 | 0.000000 | 0.818660 | It meowed once at the fish, it is still meowin... |
5 | 0.147088 | 0.145678 | 0.145678 | 0.438444 | The cat is at the fish store. The cat is orang... |
6 | 0.404858 | 0.000000 | 0.000000 | 0.404858 | Penny is a fish |
If we lok at sentence #4 specifically, “It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.”
word | before | after |
---|---|---|
fish | 0.428571 | 0.274642 |
meow | 0.428571 | 0.544019 |
Notice how ‘meow’increased in value because it’s an infrequent term across all
of the documents, and fish
dropped in value because it’s used all over the
place.
If we want a more wild difference, let’s try changing from simple percentage
normalization to the default, l2
norms with norm='l2'
(or just removing
norm
completely).
# use_idf=True is default, but I'll leave it in
l2_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=lemmatize, use_idf=True)
matrix = l2_vectorizer.fit_transform(texts)
l2_df = pd.DataFrame(matrix.toarray(), columns=l2_vectorizer.get_feature_names())
l2_df
blue | bright | bug | buy | cat | eat | fish | meow | orange | penny | store | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.512612 | 0.512612 | 0.000000 | 0.512612 | 0.000000 | 0.000000 | 0.258786 | 0.000000 | 0.000000 | 0.380417 | 0.000000 |
1 | 0.456170 | 0.456170 | 0.000000 | 0.456170 | 0.000000 | 0.000000 | 0.230292 | 0.000000 | 0.456170 | 0.338530 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.578752 | 0.578752 | 0.292176 | 0.000000 | 0.000000 | 0.000000 | 0.494698 |
3 | 0.000000 | 0.000000 | 0.354840 | 0.000000 | 0.000000 | 0.354840 | 0.179137 | 0.000000 | 0.000000 | 0.789996 | 0.303305 |
4 | 0.000000 | 0.000000 | 0.285205 | 0.000000 | 0.000000 | 0.000000 | 0.431948 | 0.855616 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.840166 | 0.000000 | 0.282766 | 0.280055 | 0.280055 | 0.000000 | 0.239382 |
6 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.562463 | 0.000000 | 0.000000 | 0.826823 | 0.000000 |
# NEW: TF-IDF
added = pd.DataFrame([tfidf_df['fish'], tfidf_df['meow'], tfidf_df['orange'], tfidf_df['fish'] + tfidf_df['meow'] + tfidf_df['orange']], index=["fish", "meow", "orange", "fish + meow + orange"]).T
added['text'] = texts
added
fish | meow | orange | fish + meow + orange | text | |
---|---|---|---|---|---|
0 | 0.118871 | 0.000000 | 0.000000 | 0.118871 | Penny bought bright blue fishes. |
1 | 0.096216 | 0.000000 | 0.190587 | 0.286802 | Penny bought bright blue and orange fish. |
2 | 0.150267 | 0.000000 | 0.000000 | 0.150267 | The cat ate a fish at the store. |
3 | 0.090377 | 0.000000 | 0.000000 | 0.090377 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.274642 | 0.544019 | 0.000000 | 0.818660 | It meowed once at the fish, it is still meowin... |
5 | 0.147088 | 0.145678 | 0.145678 | 0.438444 | The cat is at the fish store. The cat is orang... |
6 | 0.404858 | 0.000000 | 0.000000 | 0.404858 | Penny is a fish |
# NEW: TF-IDF
added = pd.DataFrame([l2_df['fish'], l2_df['meow'], l2_df['orange'], l2_df['fish'] + l2_df['meow'] + l2_df['orange']], index=["fish", "meow", "orange", "fish + meow + orange"]).T
added['text'] = texts
added
fish | meow | orange | fish + meow + orange | text | |
---|---|---|---|---|---|
0 | 0.258786 | 0.000000 | 0.000000 | 0.258786 | Penny bought bright blue fishes. |
1 | 0.230292 | 0.000000 | 0.456170 | 0.686462 | Penny bought bright blue and orange fish. |
2 | 0.292176 | 0.000000 | 0.000000 | 0.292176 | The cat ate a fish at the store. |
3 | 0.179137 | 0.000000 | 0.000000 | 0.179137 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.431948 | 0.855616 | 0.000000 | 1.287564 | It meowed once at the fish, it is still meowin... |
5 | 0.282766 | 0.280055 | 0.280055 | 0.842876 | The cat is at the fish store. The cat is orang... |
6 | 0.562463 | 0.000000 | 0.000000 | 0.562463 | Penny is a fish |
LOOK AT HOW IMPORTANT MEOW IS. Meowing is out of this world important, because no one ever meows.
Who cares? Why do we need to know this?
When someone dumps 100,000 documents on your desk in response to FOIA, you’ll start to care! One of the reasons understanding TF-IDF is important is because of document similarity. By knowing what documents are similar you’re able to find related documents and automatically group documents into clusters.
K-Means clustering
For example! Let’s cluster these sentences using K-Means clustering (check out this gif). It’s a magic way to say “these documents are similar to each other” that I’m not really going to explain very much of.
texts
['Penny bought bright blue fishes.',
'Penny bought bright blue and orange fish.',
'The cat ate a fish at the store.',
'Penny went to the store. Penny ate a bug. Penny saw a fish.',
'It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.',
'The cat is at the fish store. The cat is orange. The cat is meowing at the fish.',
'Penny is a fish']
# Initialize a vectorizer
# Not including use_idf=True because it's true by default
vectorizer = TfidfVectorizer(tokenizer=lemmatize, stop_words='english')
matrix = vectorizer.fit_transform(texts)
matrix
<7x11 sparse matrix of type '<class 'numpy.float64'>'
with 30 stored elements in Compressed Sparse Row format>
pd.DataFrame(matrix.toarray())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.512612 | 0.512612 | 0.000000 | 0.512612 | 0.000000 | 0.000000 | 0.258786 | 0.000000 | 0.000000 | 0.380417 | 0.000000 |
1 | 0.456170 | 0.456170 | 0.000000 | 0.456170 | 0.000000 | 0.000000 | 0.230292 | 0.000000 | 0.456170 | 0.338530 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.578752 | 0.578752 | 0.292176 | 0.000000 | 0.000000 | 0.000000 | 0.494698 |
3 | 0.000000 | 0.000000 | 0.354840 | 0.000000 | 0.000000 | 0.354840 | 0.179137 | 0.000000 | 0.000000 | 0.789996 | 0.303305 |
4 | 0.000000 | 0.000000 | 0.285205 | 0.000000 | 0.000000 | 0.000000 | 0.431948 | 0.855616 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.840166 | 0.000000 | 0.282766 | 0.280055 | 0.280055 | 0.000000 | 0.239382 |
6 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.562463 | 0.000000 | 0.000000 | 0.826823 | 0.000000 |
STEP ONE: “Hey, k-means, cluster my documents into two categories”
# KMeans clustering is a method of clustering.
from sklearn.cluster import KMeans
number_of_clusters = 2
# Create the classifier
km = KMeans(n_clusters=number_of_clusters)
# Put them into categories
km.fit(matrix)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
STEP TWO: “Hey, k-means, tell me about the categories”
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))
Top terms per cluster:
Cluster 0: cat meow fish store eat
Cluster 1: penny fish buy bright blue
So we have a cluster about fish and cats and a cluster about penny and
fish. You can see the categories by looking at km.labels_
- it goes in the
same order as your original text documents.
km.labels_
array([1, 1, 0, 1, 0, 0, 1], dtype=int32)
texts
['Penny bought bright blue fishes.',
'Penny bought bright blue and orange fish.',
'The cat ate a fish at the store.',
'Penny went to the store. Penny ate a bug. Penny saw a fish.',
'It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.',
'The cat is at the fish store. The cat is orange. The cat is meowing at the fish.',
'Penny is a fish']
But obviously we’re going to make a dataframe out of it because we love pandas and readability.
results = pd.DataFrame()
results['text'] = texts
results['category'] = km.labels_
results
text | category | |
---|---|---|
0 | Penny bought bright blue fishes. | 1 |
1 | Penny bought bright blue and orange fish. | 1 |
2 | The cat ate a fish at the store. | 0 |
3 | Penny went to the store. Penny ate a bug. Penn... | 1 |
4 | It meowed once at the fish, it is still meowin... | 0 |
5 | The cat is at the fish store. The cat is orang... | 0 |
6 | Penny is a fish | 1 |
Seems like it makes sense! The fun thing about k-means is you can demand however many categories you want, and it has no choice but to comply.
4 categories of documents
from sklearn.cluster import KMeans
number_of_clusters = 3
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))
Top terms per cluster:
Cluster 0: cat meow fish store eat
Cluster 1: buy bright blue penny fish
Cluster 2: penny fish eat bug store
results = pd.DataFrame()
results['text'] = texts
results['category'] = km.labels_
results
text | category | |
---|---|---|
0 | Penny bought bright blue fishes. | 1 |
1 | Penny bought bright blue and orange fish. | 1 |
2 | The cat ate a fish at the store. | 0 |
3 | Penny went to the store. Penny ate a bug. Penn... | 2 |
4 | It meowed once at the fish, it is still meowin... | 0 |
5 | The cat is at the fish store. The cat is orang... | 0 |
6 | Penny is a fish | 2 |
Let’s try to graph the similarity
texts = ['Penny bought bright blue fishes.',
'Penny bought bright blue and orange bowl.',
'The cat ate a fish at the store.',
'Penny went to the store. Penny ate a bug. Penny saw a fish.',
'It meowed once at the bug, it is still meowing at the bug and the fish',
'The cat is at the fish store. The cat is orange. The cat is meowing at the fish.',
'Penny is a fish.',
'Penny Penny she loves fishes Penny Penny is no cat.',
'The store is closed now.',
'How old is that tree?',
'I do not eat fish I do not eat cats I only eat bugs']
# Initialize a vectorizer
# "max_features" means "only figure out 2 words to use"
vectorizer = TfidfVectorizer(max_features=2, tokenizer=lemmatize, stop_words='english')
matrix = vectorizer.fit_transform(texts)
What are our two features it decided were important?
vectorizer.get_feature_names()
['fish', 'penny']
So every single row has a rating for how much about fish it is, and a rating for how much about Penny it is. We can look at it in a dataframe!
df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
df
fish | penny | |
---|---|---|
0 | 0.605349 | 0.795961 |
1 | 0.000000 | 1.000000 |
2 | 1.000000 | 0.000000 |
3 | 0.245735 | 0.969337 |
4 | 1.000000 | 0.000000 |
5 | 1.000000 | 0.000000 |
6 | 0.605349 | 0.795961 |
7 | 0.186785 | 0.982401 |
8 | 0.000000 | 0.000000 |
9 | 0.000000 | 0.000000 |
10 | 1.000000 | 0.000000 |
And hey, if we have numbers, we can make a graph!
ax = df.plot(kind='scatter', x='fish', y='penny', alpha=0.1, s=300)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")
<matplotlib.text.Text at 0x18b7119e8>
Sentences in the bottom right are not about Penny and are a lot about fish, ones on the top left are about Penny and not fish, and ones in the bottom left aren’t about either.
What does k-means have to say about this? It’ll hopefully group sentences that are similar to each other.
from sklearn.cluster import KMeans
number_of_clusters = 3
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)
df['category'] = km.labels_
df
fish | penny | category | |
---|---|---|---|
0 | 0.605349 | 0.795961 | 0 |
1 | 0.000000 | 1.000000 | 0 |
2 | 1.000000 | 0.000000 | 1 |
3 | 0.245735 | 0.969337 | 0 |
4 | 1.000000 | 0.000000 | 1 |
5 | 1.000000 | 0.000000 | 1 |
6 | 0.605349 | 0.795961 | 0 |
7 | 0.186785 | 0.982401 | 0 |
8 | 0.000000 | 0.000000 | 2 |
9 | 0.000000 | 0.000000 | 2 |
10 | 1.000000 | 0.000000 | 1 |
color_list = ['r', 'b', 'g', 'y']
colors = [color_list[i] for i in df['category']]
ax = df.plot(kind='scatter', x='fish', y='penny', alpha=0.1, s=300, c=colors)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")
<matplotlib.text.Text at 0x15341e160>
AW YEAH IT WORKED!
Let’s go further!!! Who cares about two dimensions, what about THREE? We’ll pick the top three predictive words, graph them in a 3d-ish chart, and color them according to what k-means thinks it a category.
# Initialize a vectorizer
# use_idf=True by default
vectorizer = TfidfVectorizer(max_features=3, tokenizer=lemmatize, stop_words='english')
matrix = vectorizer.fit_transform(texts)
vectorizer.get_feature_names()
['cat', 'fish', 'penny']
df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
df
cat | fish | penny | |
---|---|---|---|
0 | 0.000000 | 0.605349 | 0.795961 |
1 | 0.000000 | 0.000000 | 1.000000 |
2 | 0.824391 | 0.566020 | 0.000000 |
3 | 0.000000 | 0.245735 | 0.969337 |
4 | 0.000000 | 1.000000 | 0.000000 |
5 | 0.909273 | 0.416200 | 0.000000 |
6 | 0.000000 | 0.605349 | 0.795961 |
7 | 0.262506 | 0.180235 | 0.947948 |
8 | 0.000000 | 0.000000 | 0.000000 |
9 | 0.000000 | 0.000000 | 0.000000 |
10 | 0.824391 | 0.566020 | 0.000000 |
from sklearn.cluster import KMeans
number_of_clusters = 4
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)
# Set the category using the k-means labels
df['category'] = km.labels_
# Set the text column from the sentences
df['text'] = texts
df
cat | fish | penny | category | text | |
---|---|---|---|---|---|
0 | 0.000000 | 0.605349 | 0.795961 | 1 | Penny bought bright blue fishes. |
1 | 0.000000 | 0.000000 | 1.000000 | 1 | Penny bought bright blue and orange bowl. |
2 | 0.824391 | 0.566020 | 0.000000 | 2 | The cat ate a fish at the store. |
3 | 0.000000 | 0.245735 | 0.969337 | 1 | Penny went to the store. Penny ate a bug. Penn... |
4 | 0.000000 | 1.000000 | 0.000000 | 0 | It meowed once at the bug, it is still meowing... |
5 | 0.909273 | 0.416200 | 0.000000 | 2 | The cat is at the fish store. The cat is orang... |
6 | 0.000000 | 0.605349 | 0.795961 | 1 | Penny is a fish. |
7 | 0.262506 | 0.180235 | 0.947948 | 1 | Penny Penny she loves fishes Penny Penny is no... |
8 | 0.000000 | 0.000000 | 0.000000 | 3 | The store is closed now. |
9 | 0.000000 | 0.000000 | 0.000000 | 3 | How old is that tree? |
10 | 0.824391 | 0.566020 | 0.000000 | 2 | I do not eat fish I do not eat cats I only eat... |
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def draw(ax, df):
color_list = ['r', 'b', 'g', 'y']
colors = [color_list[i] for i in df['category']]
marker_list = ['o', 'x', 'v', 'X']
markers = [marker_list[i] for i in df['category']]
ax.scatter(df['fish'], df['penny'], df['cat'], c=colors, s=100, alpha=0.5)
ax.set_xlabel('Fish')
ax.set_ylabel('Penni')
ax.set_zlabel('Cat')
chart_count_vert = 5
chart_count_horiz = 5
number_of_graphs = chart_count_vert * chart_count_horiz
fig = plt.figure(figsize=(3 * chart_count_horiz, 3 * chart_count_vert))
for i in range(number_of_graphs):
ax = fig.add_subplot(chart_count_horiz, chart_count_vert, i + 1, projection='3d', azim=(-360 / number_of_graphs) * i)
draw(ax, df)
Weeeeee