Since books are different based on the words that are inside of them, comparing books by comparing words seems to make sense.

%matplotlib inline
import pandas as pd

Let’s compare JRR Tolkien novels with Jane Austen novels

But before we begin, a reminder that you can read in files like this:

# contents = open("file.txt").read()
the_hobbit = open("tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt").read()
# Let's pull something out of the middle, maybe?
the_hobbit[7500:8000]
'lf our height, and smaller than the bearded Dwarves. Hobbits have no beards. There is little or no magic about them, except the ordinary everyday sort which helps them to disappear quietly and quickly when large stupid folk like you and me come blundering along, making a noise like elephants which they can hear a mile off. They are inclined to be fat in the stomach; they dress in bright colours (chiefly green and yellow); wear no shoes, because their feet grow natural leathery soles and thick wa'

Reading in many text files

The best way to read in many text files is to keep each category in its own subdirectory, then use glob.glob to find all of the filenames

import glob

filenames = glob.glob("tolkien/*.txt")
filenames
['tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt',
 'tolkien/Lord of the Rings - 01 - The Fellowship of the Ring - J. R. R. Tolkien - 1955.txt',
 'tolkien/Lord of the Rings - 02 - The Two Towers - J. R. R. Tolkien - 1965.txt',
 'tolkien/Lord of the Rings - 03 - The Return of the King - J. R. R. Tolkien - 1965.txt']
contents = [open(file).read() for file in filenames]
tolkien_df = pd.DataFrame({
    'filename': filenames,
    'body': contents,
    'author': 'JRR Tolkien'
})
tolkien_df.head()
author body filename
0 JRR Tolkien THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY... tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt
1 JRR Tolkien THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ... tolkien/Lord of the Rings - 01 - The Fellowshi...
2 JRR Tolkien THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n... tolkien/Lord of the Rings - 02 - The Two Tower...
3 JRR Tolkien THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T... tolkien/Lord of the Rings - 03 - The Return of...
# Get a list of all of the files inside of austen/
filenames = glob.glob("austen/*.txt")
# Read in each of those files, save the results
contents = [open(file).read() for file in filenames]
# And use all of that to build a dataframe
austen_df = pd.DataFrame({
    'filename': filenames,
    'body': contents,
    'author': "Jane Austen"
})
austen_df.head()
author body filename
0 Jane Austen The Project Gutenberg EBook of Emma, by Jane A... austen/emma - 1815.txt
1 Jane Austen The Project Gutenberg EBook of Mansfield Park,... austen/mansfield_park - 1814.txt
2 Jane Austen The Project Gutenberg EBook of Northanger Abbe... austen/northanger_abbey - 1817.txt
3 Jane Austen The Project Gutenberg EBook of Persuasion, by ... austen/persuasion - 1817.txt
4 Jane Austen The Project Gutenberg EBook of Pride and Preju... austen/pride-and-prejudice - 1813.txt

Now let’s combine them!

df = pd.concat([tolkien_df, austen_df], ignore_index=True)
df
author body filename
0 JRR Tolkien THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY... tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt
1 JRR Tolkien THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ... tolkien/Lord of the Rings - 01 - The Fellowshi...
2 JRR Tolkien THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n... tolkien/Lord of the Rings - 02 - The Two Tower...
3 JRR Tolkien THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T... tolkien/Lord of the Rings - 03 - The Return of...
4 Jane Austen The Project Gutenberg EBook of Emma, by Jane A... austen/emma - 1815.txt
5 Jane Austen The Project Gutenberg EBook of Mansfield Park,... austen/mansfield_park - 1814.txt
6 Jane Austen The Project Gutenberg EBook of Northanger Abbe... austen/northanger_abbey - 1817.txt
7 Jane Austen The Project Gutenberg EBook of Persuasion, by ... austen/persuasion - 1817.txt
8 Jane Austen The Project Gutenberg EBook of Pride and Preju... austen/pride-and-prejudice - 1813.txt
9 Jane Austen The Project Gutenberg EBook of Sense and Sensi... austen/sense-and-sensibility - 1811.txt

Counting words

What words are we interested in? Right now just he and she.

df['body']
0    THE HOBBIT\n\nOR\n\nTHERE AND BACK AGAIN\n\nBY...
1    THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE ...
2    THE TWO TOWERS\n\n\nBEING THE SECOND PART OF\n...
3    THE RETURN\n\nOF THE KING\n\n\n\n\nBEING THE T...
4    The Project Gutenberg EBook of Emma, by Jane A...
5    The Project Gutenberg EBook of Mansfield Park,...
6    The Project Gutenberg EBook of Northanger Abbe...
7    The Project Gutenberg EBook of Persuasion, by ...
8    The Project Gutenberg EBook of Pride and Preju...
9    The Project Gutenberg EBook of Sense and Sensi...
Name: body, dtype: object
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(vocabulary=['he', 'she'], 
                             use_idf=False, 
                             norm='l1') # ELL - ONE
matrix = vectorizer.fit_transform(df['body'])
matrix
<10x2 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in Compressed Sparse Row format>
matrix.toarray()
array([[  9.99479167e-01,   5.20833333e-04],
       [  9.50636448e-01,   4.93635517e-02],
       [  9.63263864e-01,   3.67361356e-02],
       [  9.19724051e-01,   8.02759486e-02],
       [  4.36008677e-01,   5.63991323e-01],
       [  4.07700422e-01,   5.92299578e-01],
       [  3.32725061e-01,   6.67274939e-01],
       [  4.56872038e-01,   5.43127962e-01],
       [  4.39113170e-01,   5.60886830e-01],
       [  4.09373856e-01,   5.90626144e-01]])

When we do this analysis, giving in a vocabulary=, we’re getting back a percentage of usage.

  • if a book uses “he” once and “she” once, it’s going to be 0.5 and 0.5
  • if a book uses “he” a hundred times and “she” a hundred times, it’s still going to be 0.5 and 0.5.
  • if a book is ten million pages long and only uses the word ‘he’ once and the word ‘she’ none at all, it’s going to be 1.0

You need to think, do you care about the proportion relative to the other words?

results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results
he she
0 0.999479 0.000521
1 0.950636 0.049364
2 0.963264 0.036736
3 0.919724 0.080276
4 0.436009 0.563991
5 0.407700 0.592300
6 0.332725 0.667275
7 0.456872 0.543128
8 0.439113 0.560887
9 0.409374 0.590626

Store our original information back into the dataframe

df['author']
0    JRR Tolkien
1    JRR Tolkien
2    JRR Tolkien
3    JRR Tolkien
4    Jane Austen
5    Jane Austen
6    Jane Austen
7    Jane Austen
8    Jane Austen
9    Jane Austen
Name: author, dtype: object
# If you get an error here, you forgot ignore_index
# when doing pd.concat
results['author'] = df['author']
results['filename'] = df['filename']
results
he she author filename
0 0.999479 0.000521 JRR Tolkien tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt
1 0.950636 0.049364 JRR Tolkien tolkien/Lord of the Rings - 01 - The Fellowshi...
2 0.963264 0.036736 JRR Tolkien tolkien/Lord of the Rings - 02 - The Two Tower...
3 0.919724 0.080276 JRR Tolkien tolkien/Lord of the Rings - 03 - The Return of...
4 0.436009 0.563991 Jane Austen austen/emma - 1815.txt
5 0.407700 0.592300 Jane Austen austen/mansfield_park - 1814.txt
6 0.332725 0.667275 Jane Austen austen/northanger_abbey - 1817.txt
7 0.456872 0.543128 Jane Austen austen/persuasion - 1817.txt
8 0.439113 0.560887 Jane Austen austen/pride-and-prejudice - 1813.txt
9 0.409374 0.590626 Jane Austen austen/sense-and-sensibility - 1811.txt
colormap = {
    'Jane Austen': '#66c2a5',
    'JRR Tolkien': '#fc8d62'
}
colormap['Jane Austen']
'#66c2a5'
colormap['JRR Tolkien']
'#fc8d62'
df.author
0    JRR Tolkien
1    JRR Tolkien
2    JRR Tolkien
3    JRR Tolkien
4    Jane Austen
5    Jane Austen
6    Jane Austen
7    Jane Austen
8    Jane Austen
9    Jane Austen
Name: author, dtype: object
colormap = {
    'Jane Austen': '#66c2a5',
    'JRR Tolkien': '#fc8d62'
}

# STEP TWO: Use .apply to convert it into a list of colors
# df.author = get the author column
# .apply = do something to each row
# colormap[authorname] = get the value from the dictionary we just made
colors = df.author.apply(lambda authorname: colormap[authorname])
colors
0    #fc8d62
1    #fc8d62
2    #fc8d62
3    #fc8d62
4    #66c2a5
5    #66c2a5
6    #66c2a5
7    #66c2a5
8    #66c2a5
9    #66c2a5
Name: author, dtype: object
import matplotlib.pyplot as plt

# STEP ONE: Make a dictionary with column values
# as keys, and then the color as the value
colormap = {
    'Jane Austen': '#66c2a5',
    'JRR Tolkien': '#fc8d62'
}

# STEP TWO: Use .apply to convert it into a list of colors
colors = df.author.apply(lambda authorname: colormap[authorname])

# df.author = give me just the author column
# .apply = do something for every single row
# colormap


ax = results.plot(y='she', kind='barh', x='filename', color=[colors], legend=False)
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x110855dd8>

png

Trying it again

This time we’re going to count all of the other words, too. Then we’ll pull out ‘he’ and ‘she’ specifically.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
matrix = vectorizer.fit_transform(df['body'])
matrix
<10x24446 sparse matrix of type '<class 'numpy.float64'>'
	with 76411 stored elements in Compressed Sparse Row format>

Have ALL of the words available

results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results
000 007 05 10 100 1000 1001 1002 10022 1003 ... óinand óinto ómaryo ónen óre únótime únótimë úre úrimë úrui
0 0.000000 0.000011 0.00000 0.000021 0.000000 0.000000 0.000000 0.000000 0.000011 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000005 0.00000 0.000027 0.000000 0.000000 0.000000 0.000000 0.000005 0.000000 ... 0.000005 0.000005 0.000005 0.00000 0.000000 0.000000 0.000005 0.000000 0.000000 0.000000
2 0.000000 0.000007 0.00000 0.000033 0.000000 0.000000 0.000000 0.000000 0.000007 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000005 0.00001 0.000131 0.000034 0.000048 0.000078 0.000058 0.000005 0.000019 ... 0.000000 0.000000 0.000000 0.00001 0.000005 0.000005 0.000000 0.000005 0.000005 0.000005
4 0.000019 0.000000 0.00000 0.000013 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000013 0.000000 0.00000 0.000019 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000013 0.000000 0.00000 0.000013 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.000012 0.000000 0.00000 0.000012 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000008 0.000000 0.00000 0.000008 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000008 0.000000 0.00000 0.000008 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

10 rows × 24446 columns

But then only pull out the ones we want

results = results[['he', 'she']]
results
he she
0 0.020264 0.000011
1 0.016356 0.000849
2 0.017832 0.000680
3 0.014221 0.001241
4 0.011475 0.014843
5 0.009785 0.014215
6 0.007019 0.014077
7 0.011486 0.013655
8 0.010952 0.013989
9 0.009404 0.013568

Copy our author and filename back over

results['author'] = df['author']
results['filename'] = df['filename']
results
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
he she author filename
0 0.020264 0.000011 JRR Tolkien tolkien/Hobbit, The - J. R. R. Tolkien - 1960.txt
1 0.016356 0.000849 JRR Tolkien tolkien/Lord of the Rings - 01 - The Fellowshi...
2 0.017832 0.000680 JRR Tolkien tolkien/Lord of the Rings - 02 - The Two Tower...
3 0.014221 0.001241 JRR Tolkien tolkien/Lord of the Rings - 03 - The Return of...
4 0.011475 0.014843 Jane Austen austen/emma - 1815.txt
5 0.009785 0.014215 Jane Austen austen/mansfield_park - 1814.txt
6 0.007019 0.014077 Jane Austen austen/northanger_abbey - 1817.txt
7 0.011486 0.013655 Jane Austen austen/persuasion - 1817.txt
8 0.010952 0.013989 Jane Austen austen/pride-and-prejudice - 1813.txt
9 0.009404 0.013568 Jane Austen austen/sense-and-sensibility - 1811.txt
import matplotlib.pyplot as plt

colormap = {
    'Jane Austen': '#66c2a5',
    'JRR Tolkien': '#fc8d62'
}
colors = df.author.apply(lambda authorname: colormap[authorname])

ax = results.plot(y='she', kind='barh', x='filename', color=[colors], legend=False)
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x1109e3a90>

png

import matplotlib.pyplot as plt

colormap = {
    'Jane Austen': '#66c2a5',
    'JRR Tolkien': '#fc8d62'
}
colors = df.author.apply(lambda authorname: colormap[authorname])

ax = results.plot(x='she', y='he', kind='scatter', color=colors, legend=False, xlim=(0,0.02), ylim=(0,0.02))
ax.set_title("JRR Tolkien is a man's man")
<matplotlib.text.Text at 0x110a0ab70>

png