Hip hop over time

Billboard has a “R and B / Hip Hop” list, which is a little absurd because the genres aren’t quite the same. Let’s track the changes over the years.

And let’s get this out of the way now: every time you make a vectorizer, you’ll want to ask yourself a few questions.

  • What kind of vectorizer do you need? CountVectorizer? TfIdf?
  • If TdIfd, do you use use_idf=True or use_idf=False?
  • Does your vectorizer require a certain vocabulary?
  • Do you care about multiple words (“he said” vs. “she said”)? If so, do you need to do something special so it pays attention to that?
  • Should you lemmatize/stem when you’re processing?

Reading in the files

Getting a list of every file we’ll want to read in

import glob
filenames = glob.glob('hip-hop/*/*')
filenames[:5]
['hip-hop/1965/a-change-is-gonna-come-sam-cooke',
 'hip-hop/1965/a-lovers-concerto-the-toys',
 'hip-hop/1965/a-woman-can-change-a-man-joe-tex',
 'hip-hop/1965/a-womans-love-carla-thomas',
 'hip-hop/1965/aint-that-peculiar-marvin-gaye']

Reading in the files using a list comprehension

contents = [open(filename).read() for filename in filenames]
len(contents)
7458

Use the filenames and the contents to build a dataframe

import pandas as pd

df = pd.DataFrame({
    'lyrics': contents,
    'filename': filenames
})
df.head()
filename lyrics
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke [Verse 1]\nI was born by the river\nIn a littl...
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t...
2 hip-hop/1965/a-woman-can-change-a-man-joe-tex A man can say what he won't do\nBut if she rea...
3 hip-hop/1965/a-womans-love-carla-thomas When I ask you where you've been\nDon't get an...
4 hip-hop/1965/aint-that-peculiar-marvin-gaye [Verse 1]\nHoney you do me wrong but still I'm...

Extract the year into a different column

# expand=False just gets rid of a warning
df['year'] = df.filename.str.extract('hip-hop/(\d*)/', expand=False)
df.head(2)
filename lyrics year
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke [Verse 1]\nI was born by the river\nIn a littl... 1965
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965

Use the year to create a datetime column

Even though it’s a lie, because the billboard charts are weekly. I just didn’t save that information!

df['datetime'] = pd.to_datetime(df['year'], format="%Y")
df.head(2)
filename lyrics year datetime
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke [Verse 1]\nI was born by the river\nIn a littl... 1965 1965-01-01
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01

Extract the artist and song name into another column

df['title-artist'] = df.filename.str.extract('hip-hop/\d*/(.*)', expand=False)
df.head(2)
filename lyrics year datetime title-artist
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke [Verse 1]\nI was born by the river\nIn a littl... 1965 1965-01-01 a-change-is-gonna-come-sam-cooke
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01 a-lovers-concerto-the-toys

Cleaning up a little more

Let’s get rid of things like "[Verse 1]" while we’re at it.

df['lyrics'] = df['lyrics'].replace("\[.*?\]", "", regex=True).str.strip()
df.head(2)
filename lyrics year datetime title-artist
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke I was born by the river\nIn a little tent\nAnd... 1965 1965-01-01 a-change-is-gonna-come-sam-cooke
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01 a-lovers-concerto-the-toys

OKAY! We did it. We’re done. it’s clean. Let’s get down to business.

Text analysis

What do you want to do?

from sklearn.feature_extraction.text import CountVectorizer

# Make a new Count Vectorizer!!!!
# Let's only look for 'gin'
vec = CountVectorizer(vocabulary=['gin', 'patron'])

# Say hey vectorizer, please read our stuff
matrix = vec.fit_transform(df['lyrics'])

# And make a dataframe out of it
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
results.head()
gin patron
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
df.head(3)
filename lyrics year datetime title-artist
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke I was born by the river\nIn a little tent\nAnd... 1965 1965-01-01 a-change-is-gonna-come-sam-cooke
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01 a-lovers-concerto-the-toys
2 hip-hop/1965/a-woman-can-change-a-man-joe-tex A man can say what he won't do\nBut if she rea... 1965 1965-01-01 a-woman-can-change-a-man-joe-tex
df['gin'] = results['gin']
df['patron'] = results['patron']
df.head()
filename lyrics year datetime title-artist gin patron
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke I was born by the river\nIn a little tent\nAnd... 1965 1965-01-01 a-change-is-gonna-come-sam-cooke 0 0
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01 a-lovers-concerto-the-toys 0 0
2 hip-hop/1965/a-woman-can-change-a-man-joe-tex A man can say what he won't do\nBut if she rea... 1965 1965-01-01 a-woman-can-change-a-man-joe-tex 0 0
3 hip-hop/1965/a-womans-love-carla-thomas When I ask you where you've been\nDon't get an... 1965 1965-01-01 a-womans-love-carla-thomas 0 0
4 hip-hop/1965/aint-that-peculiar-marvin-gaye Honey you do me wrong but still I'm crazy abou... 1965 1965-01-01 aint-that-peculiar-marvin-gaye 0 0
df.groupby('year')['gin'].sum().plot(kind='bar', figsize=(20,4))
<matplotlib.axes._subplots.AxesSubplot at 0x185a3cac8>

png

df.groupby('year')['patron'].sum().plot(kind='bar', figsize=(20,4))
<matplotlib.axes._subplots.AxesSubplot at 0x18619cef0>

png

ARE RAPPERS ANGRY?????

Read in the emotional lexicon

filepath = "NRC-Emotion-Lexicon-v0.92/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
emolex_df = pd.read_csv(filepath,  names=["word", "emotion", "association"], skiprows=45, sep='\t')
emolex_df = emolex_df.pivot(index='word', columns='emotion', values='association').reset_index()
emolex_df.head(3)
emotion word anger anticipation disgust fear joy negative positive sadness surprise trust
0 aback 0 0 0 0 0 0 0 0 0 0
1 abacus 0 0 0 0 0 0 0 0 0 1
2 abandon 0 0 0 1 0 1 0 1 0 0

Pull out the words you want

We want angry and positive

angry_words = emolex_df[emolex_df.anger == 1].word
positive_words = emolex_df[emolex_df.positive == 1].word

Get the percentage of words that we have emotional ratings for

from sklearn.feature_extraction.text import TfidfVectorizer

# I only want you to look for words in the emotional lexicon
# because we don't know what's up with the other words
vec = TfidfVectorizer(vocabulary=emolex_df.word,
                      use_idf=False, 
                      norm='l1') # ELL - ONE
matrix = vec.fit_transform(df['lyrics'])
vocab = vec.get_feature_names()
wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
wordcount_df.head()
aback abacus abandon abandoned abandonment abate abatement abba abbot abbreviate ... zephyr zeppelin zest zip zodiac zone zoo zoological zoology zoom
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 14182 columns

Add up the percent that are angry and the percent that are positive

df['anger'] = wordcount_df[angry_words].sum(axis=1)
df['positivity'] = wordcount_df[positive_words].sum(axis=1)
df.head(3)
filename lyrics year datetime title-artist gin patron anger positivity
0 hip-hop/1965/a-change-is-gonna-come-sam-cooke I was born by the river\nIn a little tent\nAnd... 1965 1965-01-01 a-change-is-gonna-come-sam-cooke 0 0 0.000000 0.142857
1 hip-hop/1965/a-lovers-concerto-the-toys How gentle is the rain\nThat falls softly on t... 1965 1965-01-01 a-lovers-concerto-the-toys 0 0 0.022727 0.409091
2 hip-hop/1965/a-woman-can-change-a-man-joe-tex A man can say what he won't do\nBut if she rea... 1965 1965-01-01 a-woman-can-change-a-man-joe-tex 0 0 0.000000 0.090909
df.plot(x='positivity', y='anger', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x110e73978>

png

ax = df.groupby('year')['anger'].mean().plot()
df.groupby('year')['positivity'].mean().plot(ax=ax, c='red')
<matplotlib.axes._subplots.AxesSubplot at 0x1866bb198>

png