Hip hop over time

Billboard has a “R and B / Hip Hop” list, which is a little absurd because the genres aren’t quite the same. Let’s track the changes over the years.

And let’s get this out of the way now: every time you make a vectorizer, you’ll want to ask yourself a few questions.

What kind of vectorizer do you need? CountVectorizer? TfIdf?
If TdIfd, do you use use_idf=True or use_idf=False?
Does your vectorizer require a certain vocabulary?
Do you care about multiple words (“he said” vs. “she said”)? If so, do you need to do something special so it pays attention to that?
Should you lemmatize/stem when you’re processing?

Reading in the files

Getting a list of every file we’ll want to read in

import glob
filenames = glob.glob('hip-hop/*/*')
filenames[:5]

['hip-hop/1965/a-change-is-gonna-come-sam-cooke',
 'hip-hop/1965/a-lovers-concerto-the-toys',
 'hip-hop/1965/a-woman-can-change-a-man-joe-tex',
 'hip-hop/1965/a-womans-love-carla-thomas',
 'hip-hop/1965/aint-that-peculiar-marvin-gaye']

Reading in the files using a list comprehension

contents = [open(filename).read() for filename in filenames]
len(contents)

Use the filenames and the contents to build a dataframe

import pandas as pd

df = pd.DataFrame({
    'lyrics': contents,
    'filename': filenames
})
df.head()

	filename	lyrics
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	[Verse 1]\nI was born by the river\nIn a littl...
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...
2	hip-hop/1965/a-woman-can-change-a-man-joe-tex	A man can say what he won't do\nBut if she rea...
3	hip-hop/1965/a-womans-love-carla-thomas	When I ask you where you've been\nDon't get an...
4	hip-hop/1965/aint-that-peculiar-marvin-gaye	[Verse 1]\nHoney you do me wrong but still I'm...

Extract the year into a different column

# expand=False just gets rid of a warning
df['year'] = df.filename.str.extract('hip-hop/(\d*)/', expand=False)
df.head(2)

	filename	lyrics	year
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	[Verse 1]\nI was born by the river\nIn a littl...	1965
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965

Use the year to create a datetime column

Even though it’s a lie, because the billboard charts are weekly. I just didn’t save that information!

df['datetime'] = pd.to_datetime(df['year'], format="%Y")
df.head(2)

	filename	lyrics	year	datetime
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	[Verse 1]\nI was born by the river\nIn a littl...	1965	1965-01-01
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01

Extract the artist and song name into another column

df['title-artist'] = df.filename.str.extract('hip-hop/\d*/(.*)', expand=False)
df.head(2)

	filename	lyrics	year	datetime	title-artist
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	[Verse 1]\nI was born by the river\nIn a littl...	1965	1965-01-01	a-change-is-gonna-come-sam-cooke
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01	a-lovers-concerto-the-toys

Cleaning up a little more

Let’s get rid of things like "[Verse 1]" while we’re at it.

df['lyrics'] = df['lyrics'].replace("\[.*?\]", "", regex=True).str.strip()
df.head(2)

	filename	lyrics	year	datetime	title-artist
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	I was born by the river\nIn a little tent\nAnd...	1965	1965-01-01	a-change-is-gonna-come-sam-cooke
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01	a-lovers-concerto-the-toys

OKAY! We did it. We’re done. it’s clean. Let’s get down to business.

Text analysis

What do you want to do?

from sklearn.feature_extraction.text import CountVectorizer

# Make a new Count Vectorizer!!!!
# Let's only look for 'gin'
vec = CountVectorizer(vocabulary=['gin', 'patron'])

# Say hey vectorizer, please read our stuff
matrix = vec.fit_transform(df['lyrics'])

# And make a dataframe out of it
results = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
results.head()

	gin	patron
0	0	0
1	0	0
2	0	0
3	0	0
4	0	0

df.head(3)

	filename	lyrics	year	datetime	title-artist
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	I was born by the river\nIn a little tent\nAnd...	1965	1965-01-01	a-change-is-gonna-come-sam-cooke
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01	a-lovers-concerto-the-toys
2	hip-hop/1965/a-woman-can-change-a-man-joe-tex	A man can say what he won't do\nBut if she rea...	1965	1965-01-01	a-woman-can-change-a-man-joe-tex

df['gin'] = results['gin']
df['patron'] = results['patron']
df.head()

	filename	lyrics	year	datetime	title-artist
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	I was born by the river\nIn a little tent\nAnd...	1965	1965-01-01	a-change-is-gonna-come-sam-cooke
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01	a-lovers-concerto-the-toys
2	hip-hop/1965/a-woman-can-change-a-man-joe-tex	A man can say what he won't do\nBut if she rea...	1965	1965-01-01	a-woman-can-change-a-man-joe-tex
3	hip-hop/1965/a-womans-love-carla-thomas	When I ask you where you've been\nDon't get an...	1965	1965-01-01	a-womans-love-carla-thomas
4	hip-hop/1965/aint-that-peculiar-marvin-gaye	Honey you do me wrong but still I'm crazy abou...	1965	1965-01-01	aint-that-peculiar-marvin-gaye

df.groupby('year')['gin'].sum().plot(kind='bar', figsize=(20,4))

<matplotlib.axes._subplots.AxesSubplot at 0x185a3cac8>

png

df.groupby('year')['patron'].sum().plot(kind='bar', figsize=(20,4))

<matplotlib.axes._subplots.AxesSubplot at 0x18619cef0>

png

ARE RAPPERS ANGRY?????

Read in the emotional lexicon

filepath = "NRC-Emotion-Lexicon-v0.92/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
emolex_df = pd.read_csv(filepath,  names=["word", "emotion", "association"], skiprows=45, sep='\t')
emolex_df = emolex_df.pivot(index='word', columns='emotion', values='association').reset_index()
emolex_df.head(3)

emotion	word	fear	negative	sadness	trust
0	aback	0	0	0	0
1	abacus	0	0	0	1
2	abandon	1	1	1	0

Pull out the words you want

We want angry and positive

angry_words = emolex_df[emolex_df.anger == 1].word
positive_words = emolex_df[emolex_df.positive == 1].word

Get the percentage of words that we have emotional ratings for

from sklearn.feature_extraction.text import TfidfVectorizer

# I only want you to look for words in the emotional lexicon
# because we don't know what's up with the other words
vec = TfidfVectorizer(vocabulary=emolex_df.word,
                      use_idf=False, 
                      norm='l1') # ELL - ONE
matrix = vec.fit_transform(df['lyrics'])
vocab = vec.get_feature_names()
wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
wordcount_df.head()

	aback	abacus	abandon	abandoned	abandonment	abate	abatement	abba	abbot	abbreviate	...	zephyr	zeppelin	zest	zip	zodiac	zone	zoo	zoological	zoology	zoom
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 14182 columns

Add up the percent that are angry and the percent that are positive

df['anger'] = wordcount_df[angry_words].sum(axis=1)
df['positivity'] = wordcount_df[positive_words].sum(axis=1)
df.head(3)

	filename	lyrics	year	datetime	title-artist	anger	positivity
0	hip-hop/1965/a-change-is-gonna-come-sam-cooke	I was born by the river\nIn a little tent\nAnd...	1965	1965-01-01	a-change-is-gonna-come-sam-cooke	0.000000	0.142857
1	hip-hop/1965/a-lovers-concerto-the-toys	How gentle is the rain\nThat falls softly on t...	1965	1965-01-01	a-lovers-concerto-the-toys	0.022727	0.409091
2	hip-hop/1965/a-woman-can-change-a-man-joe-tex	A man can say what he won't do\nBut if she rea...	1965	1965-01-01	a-woman-can-change-a-man-joe-tex	0.000000	0.090909

df.plot(x='positivity', y='anger', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x110e73978>

png

ax = df.groupby('year')['anger'].mean().plot()
df.groupby('year')['positivity'].mean().plot(ax=ax, c='red')

<matplotlib.axes._subplots.AxesSubplot at 0x1866bb198>

png

	aback	abacus	abandon	abandoned	abandonment	abate	abatement	abba	abbot	abbreviate	...	zephyr	zeppelin	zest	zip	zodiac	zone	zoo	zoological	zoology	zoom
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	aback	abacus	abandon	abandoned	abandonment	abate	abatement	abba	abbot	abbreviate	...	zephyr	zeppelin	zest	zip	zodiac	zone	zoo	zoological	zoology	zoom
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	aback	abacus	abandon	abandoned	abandonment	abate	abatement	abba	abbot	abbreviate	...	zephyr	zeppelin	zest	zip	zodiac	zone	zoo	zoological	zoology	zoom
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0