Since books are different based on the words that are inside of them, comparing books by comparing words seems to make sense.
%matplotlib inline
import pandas as pd
Let’s compare JRR Tolkien novels with Jane Austen novels
But before we begin, a reminder that you can read in files like this:
# contents = open("file.txt").read()
the_hobbit = open("tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt").read()
# Let's pull something out of the middle, maybe?
the_hobbit[7500:8000]
'lf our height, and smaller than the bearded Dwarves. Hobbits have no beards. There is little or no magic about them, except the ordinary everyday sort which helps them to disappear quietly and quickly when large stupid folk like you and me come blundering along, making a noise like elephants which they can hear a mile off. They are inclined to be fat in the stomach; they dress in bright colours (chiefly green and yellow); wear no shoes, because their feet grow natural leathery soles and thick wa'
Reading in many text files
The best way to read in many text files is to keep each category in its own
subdirectory, then use glob.glob
to find all of the filenames
import glob
filenames = glob.glob("tolkien/*.txt")
filenames
['tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt',
'tolkien/Lord of the Rings - 01 - The Fellowship of the Ring - J. R. R. Tolkien - 1955.txt',
'tolkien/Lord of the Rings - 02 - The Two Towers - J. R. R. Tolkien - 1965.txt',
'tolkien/Lord of the Rings - 03 - The Return of the King - J. R. R. Tolkien - 1965.txt']
contents = [open(file).read() for file in filenames]
tolkien_df = pd.DataFrame({
'filename': filenames,
'body': contents,
'author': 'JRR Tolkien'
})
tolkien_df.head()
author | body | filename | |
---|---|---|---|
0 | JRR Tolkien | THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY... | tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt |
1 | JRR Tolkien | THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ... | tolkien/Lord of the Rings - 01 - The Fellowshi... |
2 | JRR Tolkien | THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n... | tolkien/Lord of the Rings - 02 - The Two Tower... |
3 | JRR Tolkien | THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T... | tolkien/Lord of the Rings - 03 - The Return of... |
# Get a list of all of the files inside of austen/
filenames = glob.glob("austen/*.txt")
# Read in each of those files, save the results
contents = [open(file).read() for file in filenames]
# And use all of that to build a dataframe
austen_df = pd.DataFrame({
'filename': filenames,
'body': contents,
'author': "Jane Austen"
})
austen_df.head()
author | body | filename | |
---|---|---|---|
0 | Jane Austen | The Project Gutenberg EBook of Emma, by Jane A... | austen/emma - 1815.txt |
1 | Jane Austen | The Project Gutenberg EBook of Mansfield Park,... | austen/mansfield_park - 1814.txt |
2 | Jane Austen | The Project Gutenberg EBook of Northanger Abbe... | austen/northanger_abbey - 1817.txt |
3 | Jane Austen | The Project Gutenberg EBook of Persuasion, by ... | austen/persuasion - 1817.txt |
4 | Jane Austen | The Project Gutenberg EBook of Pride and Preju... | austen/pride-and-prejudice - 1813.txt |
Now let’s combine them!
df = pd.concat([tolkien_df, austen_df], ignore_index=True)
df
author | body | filename | |
---|---|---|---|
0 | JRR Tolkien | THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY... | tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt |
1 | JRR Tolkien | THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ... | tolkien/Lord of the Rings - 01 - The Fellowshi... |
2 | JRR Tolkien | THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n... | tolkien/Lord of the Rings - 02 - The Two Tower... |
3 | JRR Tolkien | THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T... | tolkien/Lord of the Rings - 03 - The Return of... |
4 | Jane Austen | The Project Gutenberg EBook of Emma, by Jane A... | austen/emma - 1815.txt |
5 | Jane Austen | The Project Gutenberg EBook of Mansfield Park,... | austen/mansfield_park - 1814.txt |
6 | Jane Austen | The Project Gutenberg EBook of Northanger Abbe... | austen/northanger_abbey - 1817.txt |
7 | Jane Austen | The Project Gutenberg EBook of Persuasion, by ... | austen/persuasion - 1817.txt |
8 | Jane Austen | The Project Gutenberg EBook of Pride and Preju... | austen/pride-and-prejudice - 1813.txt |
9 | Jane Austen | The Project Gutenberg EBook of Sense and Sensi... | austen/sense-and-sensibility - 1811.txt |
Counting words
What words are we interested in? Right now just he and she.
df['body']
0 THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY...
1 THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ...
2 THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n...
3 THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T...
4 The Project Gutenberg EBook of Emma, by Jane A...
5 The Project Gutenberg EBook of Mansfield Park,...
6 The Project Gutenberg EBook of Northanger Abbe...
7 The Project Gutenberg EBook of Persuasion, by ...
8 The Project Gutenberg EBook of Pride and Preju...
9 The Project Gutenberg EBook of Sense and Sensi...
Name: body, dtype: object
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(vocabulary=['he', 'she'],
use_idf=False,
norm='l1') # ELL - ONE
matrix = vectorizer.fit_transform(df['body'])
matrix
<10x2 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
matrix.toarray()
array([[ 9.99479167e-01, 5.20833333e-04],
[ 9.50636448e-01, 4.93635517e-02],
[ 9.63263864e-01, 3.67361356e-02],
[ 9.19724051e-01, 8.02759486e-02],
[ 4.36008677e-01, 5.63991323e-01],
[ 4.07700422e-01, 5.92299578e-01],
[ 3.32725061e-01, 6.67274939e-01],
[ 4.56872038e-01, 5.43127962e-01],
[ 4.39113170e-01, 5.60886830e-01],
[ 4.09373856e-01, 5.90626144e-01]])
When we do this analysis, giving in a vocabulary=
, we’re getting back a
percentage of usage.
- if a book uses “he” once and “she” once, it’s going to be 0.5 and 0.5
- if a book uses “he” a hundred times and “she” a hundred times, it’s still going to be 0.5 and 0.5.
- if a book is ten million pages long and only uses the word ‘he’ once and the word ‘she’ none at all, it’s going to be 1.0
You need to think, do you care about the proportion relative to the other words?
results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results
he | she | |
---|---|---|
0 | 0.999479 | 0.000521 |
1 | 0.950636 | 0.049364 |
2 | 0.963264 | 0.036736 |
3 | 0.919724 | 0.080276 |
4 | 0.436009 | 0.563991 |
5 | 0.407700 | 0.592300 |
6 | 0.332725 | 0.667275 |
7 | 0.456872 | 0.543128 |
8 | 0.439113 | 0.560887 |
9 | 0.409374 | 0.590626 |
Store our original information back into the dataframe
df['author']
0 JRR Tolkien
1 JRR Tolkien
2 JRR Tolkien
3 JRR Tolkien
4 Jane Austen
5 Jane Austen
6 Jane Austen
7 Jane Austen
8 Jane Austen
9 Jane Austen
Name: author, dtype: object
# If you get an error here, you forgot ignore_index
# when doing pd.concat
results['author'] = df['author']
results['filename'] = df['filename']
results
he | she | author | filename | |
---|---|---|---|---|
0 | 0.999479 | 0.000521 | JRR Tolkien | tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt |
1 | 0.950636 | 0.049364 | JRR Tolkien | tolkien/Lord of the Rings - 01 - The Fellowshi... |
2 | 0.963264 | 0.036736 | JRR Tolkien | tolkien/Lord of the Rings - 02 - The Two Tower... |
3 | 0.919724 | 0.080276 | JRR Tolkien | tolkien/Lord of the Rings - 03 - The Return of... |
4 | 0.436009 | 0.563991 | Jane Austen | austen/emma - 1815.txt |
5 | 0.407700 | 0.592300 | Jane Austen | austen/mansfield_park - 1814.txt |
6 | 0.332725 | 0.667275 | Jane Austen | austen/northanger_abbey - 1817.txt |
7 | 0.456872 | 0.543128 | Jane Austen | austen/persuasion - 1817.txt |
8 | 0.439113 | 0.560887 | Jane Austen | austen/pride-and-prejudice - 1813.txt |
9 | 0.409374 | 0.590626 | Jane Austen | austen/sense-and-sensibility - 1811.txt |
colormap = {
'Jane Austen': '#66c2a5',
'JRR Tolkien': '#fc8d62'
}
colormap['Jane Austen']
'#66c2a5'
colormap['JRR Tolkien']
'#fc8d62'
df.author
0 JRR Tolkien
1 JRR Tolkien
2 JRR Tolkien
3 JRR Tolkien
4 Jane Austen
5 Jane Austen
6 Jane Austen
7 Jane Austen
8 Jane Austen
9 Jane Austen
Name: author, dtype: object
colormap = {
'Jane Austen': '#66c2a5',
'JRR Tolkien': '#fc8d62'
}
# STEP TWO: Use .apply to convert it into a list of colors
# df.author = get the author column
# .apply = do something to each row
# colormap[authorname] = get the value from the dictionary we just made
colors = df.author.apply(lambda authorname: colormap[authorname])
colors
0 #fc8d62
1 #fc8d62
2 #fc8d62
3 #fc8d62
4 #66c2a5
5 #66c2a5
6 #66c2a5
7 #66c2a5
8 #66c2a5
9 #66c2a5
Name: author, dtype: object
import matplotlib.pyplot as plt
# STEP ONE: Make a dictionary with column values
# as keys, and then the color as the value
colormap = {
'Jane Austen': '#66c2a5',
'JRR Tolkien': '#fc8d62'
}
# STEP TWO: Use .apply to convert it into a list of colors
colors = df.author.apply(lambda authorname: colormap[authorname])
# df.author = give me just the author column
# .apply = do something for every single row
# colormap
ax = results.plot(y='she', kind='barh', x='filename', color=[colors], legend=False)
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x110855dd8>
Trying it again
This time we’re going to count all of the other words, too. Then we’ll pull out ‘he’ and ‘she’ specifically.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
matrix = vectorizer.fit_transform(df['body'])
matrix
<10x24446 sparse matrix of type '<class 'numpy.float64'>'
with 76411 stored elements in Compressed Sparse Row format>
Have ALL of the words available
results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results
000 | 007 | 05 | 10 | 100 | 1000 | 1001 | 1002 | 10022 | 1003 | ... | óinand | óinto | ómaryo | ónen | óre | únótime | únótimë | úre | úrimë | úrui | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000011 | 0.00000 | 0.000021 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000011 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.000000 | 0.000005 | 0.00000 | 0.000027 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000005 | 0.000000 | ... | 0.000005 | 0.000005 | 0.000005 | 0.00000 | 0.000000 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.000000 | 0.000007 | 0.00000 | 0.000033 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000007 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3 | 0.000000 | 0.000005 | 0.00001 | 0.000131 | 0.000034 | 0.000048 | 0.000078 | 0.000058 | 0.000005 | 0.000019 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00001 | 0.000005 | 0.000005 | 0.000000 | 0.000005 | 0.000005 | 0.000005 |
4 | 0.000019 | 0.000000 | 0.00000 | 0.000013 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
5 | 0.000013 | 0.000000 | 0.00000 | 0.000019 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.000013 | 0.000000 | 0.00000 | 0.000013 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
7 | 0.000012 | 0.000000 | 0.00000 | 0.000012 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
8 | 0.000008 | 0.000000 | 0.00000 | 0.000008 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
9 | 0.000008 | 0.000000 | 0.00000 | 0.000008 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
10 rows × 24446 columns
But then only pull out the ones we want
results = results[['he', 'she']]
results
he | she | |
---|---|---|
0 | 0.020264 | 0.000011 |
1 | 0.016356 | 0.000849 |
2 | 0.017832 | 0.000680 |
3 | 0.014221 | 0.001241 |
4 | 0.011475 | 0.014843 |
5 | 0.009785 | 0.014215 |
6 | 0.007019 | 0.014077 |
7 | 0.011486 | 0.013655 |
8 | 0.010952 | 0.013989 |
9 | 0.009404 | 0.013568 |
Copy our author and filename back over
results['author'] = df['author']
results['filename'] = df['filename']
results
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
he | she | author | filename | |
---|---|---|---|---|
0 | 0.020264 | 0.000011 | JRR Tolkien | tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt |
1 | 0.016356 | 0.000849 | JRR Tolkien | tolkien/Lord of the Rings - 01 - The Fellowshi... |
2 | 0.017832 | 0.000680 | JRR Tolkien | tolkien/Lord of the Rings - 02 - The Two Tower... |
3 | 0.014221 | 0.001241 | JRR Tolkien | tolkien/Lord of the Rings - 03 - The Return of... |
4 | 0.011475 | 0.014843 | Jane Austen | austen/emma - 1815.txt |
5 | 0.009785 | 0.014215 | Jane Austen | austen/mansfield_park - 1814.txt |
6 | 0.007019 | 0.014077 | Jane Austen | austen/northanger_abbey - 1817.txt |
7 | 0.011486 | 0.013655 | Jane Austen | austen/persuasion - 1817.txt |
8 | 0.010952 | 0.013989 | Jane Austen | austen/pride-and-prejudice - 1813.txt |
9 | 0.009404 | 0.013568 | Jane Austen | austen/sense-and-sensibility - 1811.txt |
import matplotlib.pyplot as plt
colormap = {
'Jane Austen': '#66c2a5',
'JRR Tolkien': '#fc8d62'
}
colors = df.author.apply(lambda authorname: colormap[authorname])
ax = results.plot(y='she', kind='barh', x='filename', color=[colors], legend=False)
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x1109e3a90>
import matplotlib.pyplot as plt
colormap = {
'Jane Austen': '#66c2a5',
'JRR Tolkien': '#fc8d62'
}
colors = df.author.apply(lambda authorname: colormap[authorname])
ax = results.plot(x='she', y='he', kind='scatter', color=colors, legend=False, xlim=(0,0.02), ylim=(0,0.02))
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x110a0ab70>