Installing some packages

You’ll need to install a handful of packages.

pip3 install textblob
python3 -m textblob.download_corpora
pip3 install spacy
python3 -m spacy download en
pip3 install scikit-learn

The libraries we’ll be using

TextBlob: “Simplified text processing”

Super simple, super easy, super misleading.

https://textblob.readthedocs.io/en/dev/

spaCy: “Industrial-Strength Natural Language Processing”

Super powerful, super incomplete (kind of), super nice website, super full of gotchas.

https://spacy.io/

scikit-learn: “Machine Learning in Python”

Super powerful (in a different way than spaCy), super popular, super not focused on text analysis.

http://scikit-learn.org/

Natural Language Toolkit (NLTK): “A leading platform for building Python

programs to work with human language data”

Super old, super popular, super difficult to work with.

http://www.nltk.org/

Hey look we’re sentiment analysis experts now

With practically no knowledge and absolutely no respect for truth or accuracy, we can know analyze sentiment using TextBlob!

from textblob import TextBlob

blob = TextBlob("I hate driving the car.")
blob.sentiment
Sentiment(polarity=-0.8, subjectivity=0.9)
from textblob import TextBlob

blob = TextBlob("I don't know if i love driving the car SO MUCH.")
blob.sentiment
Sentiment(polarity=0.35, subjectivity=0.4)
tweet = '''I am very disappointed in China. Our foolish past leaders 
have allowed them to make hundreds of billions of dollars a year in trade, yet
they do NOTHING for us with North Korea, just talk. We will no longer allow this
to continue. China could easily solve this problem!'''

TextBlob(tweet).sentiment
Sentiment(polarity=-0.22777777777777777, subjectivity=0.6861111111111112)

Don’t ever do this unless you are reading movie reviews! You can use the documentation to build your own easily enough. We’re going to talk about how to do it step-by-step in other ways, too.

Counting words with .count

When you first start counting words in Python, you’ll probably just try out .count. Let’s take a sentence and see how many times “fish” is used in it.

"I tried to fish for fish but I didn't catch any fish".count("fish")
3

Pretty simple, right? They don’t always need to be sentences, either, you can do this on longer text, too. Paragraphs, multiple lines, even whole books!

The big downside is that it’s going to count any instance of your word, even if it’s in the middle of a word.

"The canny toucan can't recant about the pelican's scant canteloupe".count("can")
7

That sentence has "can" in it at most once, so we’re going to need to do something better.

Tokenization

So I guess that isn’t going to work! Luckily for us, people worked for billions of years to solve this problem using something called tokenization. Tokenization in this situation means “splitting up a sentence into the parts that matter.”

You can think of it as breaking the sentence apart in words.

fish_sentence = "I tried to fish for fish but I didn't catch any fish"
fish_sentence.split(" ")
['I',
 'tried',
 'to',
 'fish',
 'for',
 'fish',
 'but',
 'I',
 "didn't",
 'catch',
 'any',
 'fish']
fish_sentence = "I tried to fish for fish but I didn't catch any fish"
fish_list = fish_sentence.split(" ")
fish_list.count("fish")
3
toucan_sentence = "The canny toucan can't recant the pelican's scant canteloupe"
toucan_list = toucan_sentence.split(" ")
toucan_list.count("can")
0

Improving our tokenization

Splitting on spaces and counting words is fiiiine, but problems start to show up with capital letters and punctuation.

dinner_sentence = "Dinner was great tonight, I enjoyed the potatoes."
dinner_list = dinner_sentence.split(" ")
dinner_list
['Dinner', 'was', 'great', 'tonight,', 'I', 'enjoyed', 'the', 'potatoes.']
# How many times does 'dinner' appear?
dinner_list.count("dinner")
0
# How many times does 'potatoes' appear?
dinner_list.count("potatoes")
0
dinner_sentence = "Dinner was great tonight, I enjoyed the potatoes."
dinner_sentence = dinner_sentence.lower().replace(".", "")
dinner_list = dinner_sentence.split(" ")
dinner_list
['dinner', 'was', 'great', 'tonight,', 'i', 'enjoyed', 'the', 'potatoes']
dinner_list.count("dinner")
1
dinner_list.count("potatoes")
1

Tokenizing with TextBlob

TextBlob is easy to use, but not as full-featured as spaCy.

# First we'll import TextBlob
from textblob import TextBlob
# Then we'll use TextBlob
blob = TextBlob("The dangerous cats ran dangerously toward dangers.")

for token in blob.tokens:
    print(token)
The
dangerous
cats
ran
dangerously
toward
dangers
.
for word in blob.words:
    print(word)
The
dangerous
cats
ran
dangerously
toward
dangers
blob = TextBlob("I am a sentence")
blob.words
WordList(['I', 'am', 'a', 'sentence'])
blob = TextBlob("у меня зазвонил телефон")
blob.words
WordList(['у', 'меня', 'зазвонил', 'телефон'])
blob = TextBlob("私は鉛筆です")
blob.words
WordList(['私は鉛筆です'])

Notice how it keeps the period when you use .tokens?

Tokenizing with spaCy

spaCy can do anything, but is a bit more difficult to use than TextBlob.

# First we'll import spaCy
import spacy
nlp = spacy.load('en')
# Then we'll use spaCy
doc = nlp("The dangerous cats ran dangerously toward dangers")
tokens = [token for token in doc]
tokens
[The, dangerous, cats, ran, dangerously, toward, dangers]

Other magic with TextBlob

You’ll probably wind up using TextBlob for things, so here’re some other fun stuff it can do!

text = '''Today I went driving to the 
grocery store.  I hate to drive the car, 
but I love visiting the gas station!'''

All of the words

This is like tokens, but gets rid of the punctuation.

doc = TextBlob(text)
doc.words
WordList(['Today', 'I', 'went', 'driving', 'to', 'the', 'grocery', 'store', 'I', 'hate', 'to', 'drive', 'the', 'car', 'but', 'I', 'love', 'visiting', 'the', 'gas', 'station'])

Noun phrases

doc = TextBlob(text)
doc.noun_phrases
WordList(['grocery store', 'gas station'])

All of the sentences

doc = TextBlob(text)
doc.sentences
[Sentence("Today I went driving to the 
 grocery store."), Sentence("I hate to drive the car, 
 but I love visiting the gas station!")]
[len(sent.words) for sent in doc.sentences]
[8, 13]

n-grams

doc = TextBlob("I went to the pet store to buy a fish.")
doc.ngrams(2)
[WordList(['I', 'went']),
 WordList(['went', 'to']),
 WordList(['to', 'the']),
 WordList(['the', 'pet']),
 WordList(['pet', 'store']),
 WordList(['store', 'to']),
 WordList(['to', 'buy']),
 WordList(['buy', 'a']),
 WordList(['a', 'fish'])]
doc = TextBlob("He really likes cars. I don't like cars, but she said she would buy a car for the moon. He screamed.".lower())

# Shorter way
[ngram for ngram in doc.ngrams(2) if ngram[0] == 'he']

# Longer way
for ngram in doc.ngrams(2):
    if ngram[0] == 'he':
        print(ngram)
['he', 'really']
['he', 'screamed']

What can a word do?

doc = TextBlob("The dangerous cats ran dangerously toward dangers.")
words = doc.words
words
WordList(['The', 'dangerous', 'cats', 'ran', 'dangerously', 'toward', 'dangers'])
words[2]
'cats'
words[2].pluralize()
'catss'
words[2].pluralize().pluralize().pluralize().pluralize()
'catssesses'

Stems and lemmas with TextBlob

This is something a word can do, but it’s pretty important.

blob = TextBlob("The dangerous cats ran dangerously toward dangers.")
words = doc.words

for word in words:
    print("ORIGINAL:", word, "| LEMMA:", word.lemmatize(), "| STEM:", word.stem())
ORIGINAL: The | LEMMA: The | STEM: the
ORIGINAL: dangerous | LEMMA: dangerous | STEM: danger
ORIGINAL: cats | LEMMA: cat | STEM: cat
ORIGINAL: ran | LEMMA: ran | STEM: ran
ORIGINAL: dangerously | LEMMA: dangerously | STEM: danger
ORIGINAL: toward | LEMMA: toward | STEM: toward
ORIGINAL: dangers | LEMMA: danger | STEM: danger
blob.tokens.count('danger')
0
blob.words

The lemmatizing in TextBlob is terrible. It thinks everything is a noun, and you have to specifically tell it otherwise.

TextBlob("fairies").words[0].lemmatize()
'fairy'
TextBlob("fairies").words[0].stem()
'fairi'
TextBlob("cats").words[0].lemmatize()
'cat'
TextBlob("running").words[0].lemmatize()
'running'
TextBlob("running").words[0].stem()
'run'
TextBlob("ran").words[0].stem()
'ran'
TextBlob("running").words[0].lemmatize('v')
'run'
TextBlob("ran").words[0].lemmatize('v')
'run'

Tokenizing with spaCy

spaCy is much more powerful than TextBlob, but it’s a little more difficult to use.

# First we'll import spaCy
import spacy
nlp = spacy.load('en')
# Then we'll use spaCy
doc = nlp("The dangerous cats ran dangerously toward dangers.")
tokens = [token for token in doc]
tokens
[The, dangerous, cats, ran, dangerously, toward, dangers, .]
# Then we'll use spaCy
doc = nlp("The dangerous cats ran dangerously toward dangers.")
tokens = [token.lemma for token in doc]
tokens
[501, 2321, 2481, 1022, 18671, 3185, 4541, 453]

Sentiment with spaCy???

phrases = [
    'i love cars', 
    'i hate cars', 
    'i butter cars', 
    'misery and gloomy pain cars'
]

for phrase in phrases:
    doc = nlp(phrase)
    print("The sentiment for", doc, "is", doc[0].sentiment)
The sentiment for i love cars is 0.0
The sentiment for i hate cars is 0.0
The sentiment for i butter cars is 0.0
The sentiment for misery and gloomy pain cars is 0.0
words = ['love', 'hate', 'butter', 'misery and gloomy pain']
for word in words:
    blob = TextBlob(word)
    print("The sentiment for", word, "is", blob.sentiment)
The sentiment for love is Sentiment(polarity=0.5, subjectivity=0.6)
The sentiment for hate is Sentiment(polarity=-0.8, subjectivity=0.9)
The sentiment for butter is Sentiment(polarity=0.0, subjectivity=0.0)
The sentiment for misery and gloomy pain is Sentiment(polarity=0.0, subjectivity=0.0)

Intro to scikit-learn (sklearn)

Scikit-learn is for machine learning, which is turns out is kind of what we’re doing.

phrases = [
    'i love cars', 
    'i hate cars', 
    'cars butter cars', 
    'misery and gloomy pain cars',
    'the cars hate butter'
]

Words into numbers: Vectorization

The process of converting words (which computers can’t understand) to numbers (which computers can understand) is called vectorization. So what we’re about to do is…. vectorization!

We take our vectorizer - the CountVectorizer - from a part of scikit-learn called feature_extraction.text. In machine learning, “features” are things that make an object unique. It’s how you compare things and classify things and know how each thing is different.

In this case, our things are sentences, and you know how they’re different because they have different words: our “features” are word counts. That’s why this next line reads like it does!

from sklearn.feature_extraction.text import CountVectorizer

Now that we’ve imported the vectorizer - which will convert words to numbers - we need to use the vectorizer to count all of the words in our sentences. CountVectorizer always takes a list of documents, never just one! It’s kind of boring to compare one document to nothing.

# Give me a THING that will count words for me!!!!!
vec = CountVectorizer()
# I have some sentences, please count the words in them
matrix = vec.fit_transform(phrases)
# What did you find?????
matrix
<5x9 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

A sparse matrix is just a Python way of saying “we have a list of lists but there isn’t much stuff in it so I don’t want to show it to you. We’ll use .toarray() to see the result.

Each row is a sentence, and each column is a word. The numbers are how many times that word appears in that sentence.

# Our content has been.... VECTORIZED!!!!
matrix.toarray()
array([[0, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 1, 2, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 1, 1, 0, 1, 0, 0, 0, 1]], dtype=int64)

Looks kind of ugly, though. Let’s make a dataframe out of it!

import pandas as pd

pd.DataFrame(matrix.toarray())
0 1 2 3 4 5 6 7 8
0 0 0 1 0 0 1 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 1 2 0 0 0 0 0 0
3 1 0 1 1 0 0 1 1 0
4 0 1 1 0 1 0 0 0 1

That’s a little bit nicer, but if only there were a way to get a list of the original words. I don’t know which words “column number 2” is!

Oh but there is a way to get the words:

vec.get_feature_names()
['and', 'butter', 'cars', 'gloomy', 'hate', 'love', 'misery', 'pain', 'the']

And hey, what if we set it in as the column names for the dataframe???

docs = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
docs
and butter cars gloomy hate love misery pain the
0 0 0 1 0 0 1 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 1 2 0 0 0 0 0 0
3 1 0 1 1 0 0 1 1 0
4 0 1 1 0 1 0 0 0 1

Oh boy. Such magic. Which sentence includes ‘cars’ the most times?

docs.sort_values(by='cars', ascending=False)
and butter cars gloomy hate love misery pain the
2 0 1 2 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0 0
1 0 0 1 0 1 0 0 0 0
3 1 0 1 1 0 0 1 1 0
4 0 1 1 0 1 0 0 0 1