Installing some packages
You’ll need to install a handful of packages.
pip3 install textblob
python3 -m textblob.download_corpora
pip3 install spacy
python3 -m spacy download en
pip3 install scikit-learn
The libraries we’ll be using
TextBlob: “Simplified text processing”
Super simple, super easy, super misleading.
https://textblob.readthedocs.io/en/dev/
spaCy: “Industrial-Strength Natural Language Processing”
Super powerful, super incomplete (kind of), super nice website, super full of gotchas.
https://spacy.io/
scikit-learn: “Machine Learning in Python”
Super powerful (in a different way than spaCy), super popular, super not focused on text analysis.
http://scikit-learn.org/
Natural Language Toolkit (NLTK): “A leading platform for building Python
programs to work with human language data”
Super old, super popular, super difficult to work with.
http://www.nltk.org/
Hey look we’re sentiment analysis experts now
With practically no knowledge and absolutely no respect for truth or accuracy, we can know analyze sentiment using TextBlob!
Sentiment(polarity=-0.8, subjectivity=0.9)
Sentiment(polarity=0.35, subjectivity=0.4)
Sentiment(polarity=-0.22777777777777777, subjectivity=0.6861111111111112)
Don’t ever do this unless you are reading movie reviews! You can use the documentation to build your own easily enough. We’re going to talk about how to do it step-by-step in other ways, too.
Counting words with .count
When you first start counting words in Python, you’ll probably just try out
.count
. Let’s take a sentence and see how many times “fish” is used in it.
3
Pretty simple, right? They don’t always need to be sentences, either, you can do this on longer text, too. Paragraphs, multiple lines, even whole books!
The big downside is that it’s going to count any instance of your word, even if it’s in the middle of a word.
7
That sentence has "can"
in it at most once, so we’re going to need to do
something better.
Tokenization
So I guess that isn’t going to work! Luckily for us, people worked for billions of years to solve this problem using something called tokenization. Tokenization in this situation means “splitting up a sentence into the parts that matter.”
You can think of it as breaking the sentence apart in words.
['I',
'tried',
'to',
'fish',
'for',
'fish',
'but',
'I',
"didn't",
'catch',
'any',
'fish']
3
0
Improving our tokenization
Splitting on spaces and counting words is fiiiine, but problems start to show up with capital letters and punctuation.
['Dinner', 'was', 'great', 'tonight,', 'I', 'enjoyed', 'the', 'potatoes.']
0
0
['dinner', 'was', 'great', 'tonight,', 'i', 'enjoyed', 'the', 'potatoes']
1
1
Tokenizing with TextBlob
TextBlob is easy to use, but not as full-featured as spaCy.
The
dangerous
cats
ran
dangerously
toward
dangers
.
The
dangerous
cats
ran
dangerously
toward
dangers
WordList(['I', 'am', 'a', 'sentence'])
WordList(['у', 'меня', 'зазвонил', 'телефон'])
WordList(['私は鉛筆です'])
Notice how it keeps the period when you use .tokens
?
Tokenizing with spaCy
spaCy can do anything, but is a bit more difficult to use than TextBlob.
[The, dangerous, cats, ran, dangerously, toward, dangers]
Other magic with TextBlob
You’ll probably wind up using TextBlob for things, so here’re some other fun stuff it can do!
All of the words
This is like tokens, but gets rid of the punctuation.
WordList(['Today', 'I', 'went', 'driving', 'to', 'the', 'grocery', 'store', 'I', 'hate', 'to', 'drive', 'the', 'car', 'but', 'I', 'love', 'visiting', 'the', 'gas', 'station'])
Noun phrases
WordList(['grocery store', 'gas station'])
All of the sentences
[Sentence("Today I went driving to the
grocery store."), Sentence("I hate to drive the car,
but I love visiting the gas station!")]
[8, 13]
n-grams
[WordList(['I', 'went']),
WordList(['went', 'to']),
WordList(['to', 'the']),
WordList(['the', 'pet']),
WordList(['pet', 'store']),
WordList(['store', 'to']),
WordList(['to', 'buy']),
WordList(['buy', 'a']),
WordList(['a', 'fish'])]
['he', 'really']
['he', 'screamed']
What can a word do?
WordList(['The', 'dangerous', 'cats', 'ran', 'dangerously', 'toward', 'dangers'])
'cats'
'catss'
'catssesses'
Stems and lemmas with TextBlob
This is something a word can do, but it’s pretty important.
ORIGINAL: The | LEMMA: The | STEM: the
ORIGINAL: dangerous | LEMMA: dangerous | STEM: danger
ORIGINAL: cats | LEMMA: cat | STEM: cat
ORIGINAL: ran | LEMMA: ran | STEM: ran
ORIGINAL: dangerously | LEMMA: dangerously | STEM: danger
ORIGINAL: toward | LEMMA: toward | STEM: toward
ORIGINAL: dangers | LEMMA: danger | STEM: danger
0
The lemmatizing in TextBlob is terrible. It thinks everything is a noun, and you have to specifically tell it otherwise.
'fairy'
'fairi'
'cat'
'running'
'run'
'ran'
'run'
'run'
Tokenizing with spaCy
spaCy is much more powerful than TextBlob, but it’s a little more difficult to use.
[The, dangerous, cats, ran, dangerously, toward, dangers, .]
[501, 2321, 2481, 1022, 18671, 3185, 4541, 453]
Sentiment with spaCy???
The sentiment for i love cars is 0.0
The sentiment for i hate cars is 0.0
The sentiment for i butter cars is 0.0
The sentiment for misery and gloomy pain cars is 0.0
The sentiment for love is Sentiment(polarity=0.5, subjectivity=0.6)
The sentiment for hate is Sentiment(polarity=-0.8, subjectivity=0.9)
The sentiment for butter is Sentiment(polarity=0.0, subjectivity=0.0)
The sentiment for misery and gloomy pain is Sentiment(polarity=0.0, subjectivity=0.0)
Intro to scikit-learn (sklearn)
Scikit-learn is for machine learning, which is turns out is kind of what we’re doing.
Words into numbers: Vectorization
The process of converting words (which computers can’t understand) to numbers (which computers can understand) is called vectorization. So what we’re about to do is…. vectorization!
We take our vectorizer - the CountVectorizer
- from a part of scikit-learn
called feature_extraction.text
. In machine learning, “features” are things
that make an object unique. It’s how you compare things and classify things and
know how each thing is different.
In this case, our things are sentences, and you know how they’re different because they have different words: our “features” are word counts. That’s why this next line reads like it does!
Now that we’ve imported the vectorizer - which will convert words to numbers -
we need to use the vectorizer to count all of the words in our sentences.
CountVectorizer
always takes a list of documents, never just one! It’s kind of
boring to compare one document to nothing.
<5x9 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in Compressed Sparse Row format>
A sparse matrix is just a Python way of saying “we have a list of lists but there isn’t much stuff in it so I don’t want to show it to you. We’ll use .toarray() to see the result.
Each row is a sentence, and each column is a word. The numbers are how many times that word appears in that sentence.
array([[0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 1, 1, 0],
[0, 1, 1, 0, 1, 0, 0, 0, 1]], dtype=int64)
Looks kind of ugly, though. Let’s make a dataframe out of it!
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
4 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
That’s a little bit nicer, but if only there were a way to get a list of the original words. I don’t know which words “column number 2” is!
Oh but there is a way to get the words:
['and', 'butter', 'cars', 'gloomy', 'hate', 'love', 'misery', 'pain', 'the']
And hey, what if we set it in as the column names for the dataframe???
and | butter | cars | gloomy | hate | love | misery | pain | the | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
4 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
Oh boy. Such magic. Which sentence includes ‘cars’ the most times?
and | butter | cars | gloomy | hate | love | misery | pain | the | |
---|---|---|---|---|---|---|---|---|---|
2 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
4 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |