Algorithms: Rules of Play
- Name of the algorithm
- What it’s used for (classification, clustering, maybe other things?)
- Why is it better/worse than other classification/clustering/etc algorithms
- How to get our data into a format that is good for that algorithm
- REALISTIC data sets
- What the output means technically
- What the output means in like real life language and practically speaking
- What kind of datasets you use this algorithm for
- Examples of when it was used in journalism OR maybe could have been used
- Examples of when it was used period
- Pitfalls
- Maybe maybe maybe a little bit of math
- How to ground them for a less technical audience and to help engage them in what the algorithm is doing
Naive Bayes
Download and extract recipes.csv.zip
from #algorithms
and start a new
Jupyter Notebook!!!!
Classification algorithm - spam filter
The more spammy words that are in an email, the more like it is to be spam
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("../recipes.csv")
df.head()
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
QUESTION ONE: What are we doing and why are we using Naive Bayes?
We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?
We’re going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it’s italian because we love italian food and we only want to eat italian food.
RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.
For clustering
- You’ll get a lot of documents
- You feed it to an algorithm, tell it create
x
number of categories - The machine gives you back categories whether they make sense or not
For classification (which we are doing now)
- You’ll get a lot of documents
- You’ll classify some of them into categories that you know and love
- You’ll ask the algorithm what categories a new bunch of unlabeled documents end up in
All mean the same thing: CATEGORY = CLASS = LABEL
The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).
How does Naive Bayes work?
NAIVE BAYES WORKS WITH TEXT (kind of)
Bayes Theorem (kind of)
- If you see a word that is normally in a spam email, there’s a higher chance it’s spam
- If you see a word that is normally in a non-spam email, there’s a higher chance it’s not spam
Naive: every word/ingredient/etc is independent of any other word
FOR US: If you see ingredients that are normally in italian food, it’s probably italian
Secret trick: you can’t just use text, you have to convert into numbers
Types of Naive Bayes
Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.
Multinominal Naive Bayes - (multiple numbers): You count the words. You care about whether a word appears once or twice or three times or ten times. This is better for long passages
Bernoulli Naive Bayes - True/False Bayes: You only care if the word shows up
(True
) or it doesn’t show up (False
) - this is better for short passages
STEP ONE: Let’s convert our text data into numerical data
df.head()
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
Our problem: Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn’t matter
Two things to convert into numbers:
- Our labels (a.k.a. the categories everything belongs in)
- Our features
Converting our labels into numbers
We have two labels
- italian =
1
- not italian =
0
df.head()
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
def make_label(cuisine):
if cuisine == "italian":
return 1
else:
return 0
df['label'] = df['cuisine'].apply(make_label)
df.head(10)
cuisine | id | ingredient_list | label | |
---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 |
5 | jamaican | 6602 | plain flour, sugar, butter, eggs, fresh ginger... | 0 |
6 | spanish | 42779 | olive oil, salt, medium shrimp, pepper, garlic... | 0 |
7 | italian | 3735 | sugar, pistachio nuts, white almond bark, flou... | 1 |
8 | mexican | 16903 | olive oil, purple onion, fresh pineapple, pork... | 0 |
9 | italian | 12734 | chopped tomatoes, fresh basil, garlic, extra-v... | 1 |
Converting our features into numbers
Feature selection: The process of selecting the features that matter, in this case - what ingredients do we want to look at?
Our feature is going to be: whether it has spaghetti or not
df['has_spaghetti'] = df['ingredient_list'].str.contains("spaghetti")
df['has_curry_powder'] = df['ingredient_list'].str.contains("curry powder")
df.head(10)
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | |
---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 | False | False |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 | False | False |
5 | jamaican | 6602 | plain flour, sugar, butter, eggs, fresh ginger... | 0 | False | False |
6 | spanish | 42779 | olive oil, salt, medium shrimp, pepper, garlic... | 0 | False | False |
7 | italian | 3735 | sugar, pistachio nuts, white almond bark, flou... | 1 | False | False |
8 | mexican | 16903 | olive oil, purple onion, fresh pineapple, pork... | 0 | False | False |
9 | italian | 12734 | chopped tomatoes, fresh basil, garlic, extra-v... | 1 | False | False |
Let’s run our tests
Let’s feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!
Looking at our labels
We stored it in label
, and if it’s 0
it’s not italian, if it’s 1
it is
Italian
df['label'].head()
0 0
1 0
2 0
3 0
4 0
Name: label, dtype: int64
Look at our features
We have two features has_spaghetti
and has_curry_powder
.
df[['has_spaghetti', 'has_curry_powder']].head()
has_spaghetti | has_curry_powder | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
Now let’s finally do this
# We need to split into training and testing data
from sklearn.cross_validation import train_test_split
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)
X_train, X_test, y_train, y_test = train_test_split(
df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
test_size=0.2) # 80% training, 20% testing
# Oh hey, it's just our features from the dataframe
X_train
has_spaghetti | has_curry_powder | |
---|---|---|
18816 | False | False |
30480 | False | False |
19110 | False | False |
29312 | False | False |
23782 | False | False |
7907 | False | False |
2456 | False | False |
27221 | False | False |
5228 | False | False |
37623 | False | False |
18641 | False | False |
9029 | False | False |
5549 | False | False |
21559 | False | False |
12168 | False | False |
4836 | False | False |
3560 | False | False |
21578 | False | False |
33579 | False | False |
5965 | False | False |
32581 | False | False |
5274 | False | False |
1544 | False | False |
10885 | False | False |
22168 | False | False |
29798 | False | False |
31228 | False | False |
4636 | False | False |
38889 | False | False |
39444 | False | False |
... | ... | ... |
19956 | False | False |
14863 | False | False |
8335 | False | False |
21372 | False | False |
8720 | False | False |
11752 | False | False |
10551 | False | False |
37474 | False | False |
7905 | False | False |
5923 | False | False |
14526 | False | False |
673 | False | False |
30444 | False | False |
15322 | False | False |
5476 | False | False |
37545 | False | False |
32634 | False | False |
36936 | False | False |
18970 | False | False |
5622 | False | False |
10731 | False | False |
37097 | False | False |
5822 | False | False |
35856 | False | False |
7579 | False | False |
27918 | False | False |
8601 | False | False |
5245 | False | False |
39665 | False | False |
13013 | False | False |
31819 rows × 2 columns
# X is always the features, whether it's for training or for testing
X_test
has_spaghetti | has_curry_powder | |
---|---|---|
23827 | False | False |
24607 | False | False |
16829 | False | False |
6473 | False | False |
23662 | False | False |
19742 | False | False |
37244 | False | False |
19552 | False | False |
6361 | False | False |
6786 | False | False |
27241 | False | False |
9034 | False | False |
34423 | False | False |
33399 | False | False |
19641 | False | True |
15389 | False | True |
11627 | False | False |
25811 | False | False |
22079 | False | False |
5254 | False | False |
22499 | False | False |
18948 | False | False |
13672 | False | False |
31390 | False | False |
26623 | False | False |
36470 | False | False |
14916 | False | False |
22337 | False | False |
27339 | False | False |
38540 | False | False |
... | ... | ... |
3409 | False | False |
38281 | False | False |
12014 | False | False |
10908 | False | False |
4647 | False | False |
22629 | False | False |
32925 | False | False |
20743 | False | False |
25604 | False | False |
34821 | False | False |
38273 | False | False |
24241 | False | False |
28217 | False | False |
25094 | False | False |
9433 | False | False |
3755 | False | False |
12877 | False | False |
37839 | False | False |
30193 | False | False |
5866 | False | False |
22191 | False | False |
29451 | False | True |
29878 | False | False |
26103 | False | False |
9126 | False | False |
32127 | False | False |
34047 | False | False |
3324 | False | False |
31076 | False | False |
104 | False | False |
7955 rows × 2 columns
len(X_train)
31819
len(X_test)
7955
# We're testing on ~8000 and training on ~32000
# y_train is our labels that we are training one
y_train
18816 0
30480 0
19110 0
29312 1
23782 0
7907 0
2456 0
27221 0
5228 0
37623 0
18641 0
9029 0
5549 0
21559 0
12168 0
4836 0
3560 0
21578 0
33579 0
5965 0
32581 0
5274 1
1544 1
10885 0
22168 0
29798 1
31228 0
4636 1
38889 0
39444 0
..
19956 0
14863 1
8335 0
21372 1
8720 0
11752 0
10551 1
37474 1
7905 1
5923 1
14526 1
673 0
30444 0
15322 0
5476 0
37545 1
32634 0
36936 0
18970 0
5622 0
10731 0
37097 0
5822 0
35856 0
7579 0
27918 1
8601 0
5245 0
39665 0
13013 0
Name: label, dtype: int64
# And y_test is the labels we're testing on
y_test
23827 0
24607 0
16829 1
6473 0
23662 0
19742 0
37244 0
19552 1
6361 0
6786 0
27241 0
9034 1
34423 1
33399 0
19641 0
15389 0
11627 0
25811 0
22079 1
5254 0
22499 0
18948 1
13672 1
31390 0
26623 1
36470 0
14916 1
22337 0
27339 0
38540 0
..
3409 1
38281 0
12014 1
10908 0
4647 0
22629 0
32925 0
20743 0
25604 1
34821 0
38273 1
24241 1
28217 0
25094 0
9433 0
3755 1
12877 0
37839 0
30193 0
5866 0
22191 0
29451 0
29878 0
26103 0
9126 0
32127 0
34047 0
3324 0
31076 0
104 0
Name: label, dtype: int64
print("Length of training labels:", len(y_train))
print("Length of testing labels:", len(y_test))
print("Length of training features:", len(X_train))
print("Length of testing features:", len(X_test))
Length of training labels: 31819
Length of testing labels: 7955
Length of training features: 31819
Length of testing features: 7955
Basically all that happened was train_test_split
took us from having a nice
dataframe where everything was together and split it into two groups of two -
separated our labels vs. our features, and our training data vs our testing
data.
Back to actually doing our fitting etc
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)
X_train, X_test, y_train, y_test = train_test_split(
df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
test_size=0.2) # 80% training, 20% testing
# Import naive_bayes to get access to ALL kinds of naive bayes classifiers
# But REMEMBER we're using Bernoulli because it's for true/false which is fine
# for small passages
from sklearn import naive_bayes
# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()
# Feed the classifier two things:
# * our training features (X_train)
# * our training labels (y_train)
# To help it study for the exam later when we test it
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
# This looks ugly but in theory it's what every recipe is
# All those zeroes = not italian
# We know the first three aren't italian and the last three aren't italian
clf.predict(X_test)
array([0, 0, 0, ..., 0, 0, 0])
# Naive Bayes can't overfit, really
# It can't "study too hard" it can't "memorize the questions"
# (a decision tree can)
# So if we give it the training data back it will get some wrong
clf.score(X_train, y_train)
0.81083629278104274
clf.score(X_test, y_test)
0.80905091137649277
df['cuisine'].value_counts()
italian 7838
mexican 6438
southern_us 4320
indian 3003
chinese 2673
french 2646
cajun_creole 1546
thai 1539
japanese 1423
greek 1175
spanish 989
korean 830
vietnamese 825
moroccan 821
british 804
filipino 755
irish 667
jamaican 526
russian 489
brazilian 467
Name: cuisine, dtype: int64
df['has_spaghetti']
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 True
28 False
29 False
...
39744 False
39745 False
39746 False
39747 False
39748 False
39749 False
39750 False
39751 False
39752 False
39753 False
39754 False
39755 False
39756 False
39757 False
39758 False
39759 False
39760 False
39761 False
39762 False
39763 False
39764 False
39765 False
39766 False
39767 True
39768 False
39769 False
39770 False
39771 False
39772 False
39773 False
Name: has_spaghetti, dtype: bool
#df[['has_spaghetti', 'has_curry_powder']]
df[['has_spaghetti']]
has_spaghetti | |
---|---|
0 | False |
1 | False |
2 | False |
3 | False |
4 | False |
5 | False |
6 | False |
7 | False |
8 | False |
9 | False |
10 | False |
11 | False |
12 | False |
13 | False |
14 | False |
15 | False |
16 | False |
17 | False |
18 | False |
19 | False |
20 | False |
21 | False |
22 | False |
23 | False |
24 | False |
25 | False |
26 | False |
27 | True |
28 | False |
29 | False |
... | ... |
39744 | False |
39745 | False |
39746 | False |
39747 | False |
39748 | False |
39749 | False |
39750 | False |
39751 | False |
39752 | False |
39753 | False |
39754 | False |
39755 | False |
39756 | False |
39757 | False |
39758 | False |
39759 | False |
39760 | False |
39761 | False |
39762 | False |
39763 | False |
39764 | False |
39765 | False |
39766 | False |
39767 | True |
39768 | False |
39769 | False |
39770 | False |
39771 | False |
39772 | False |
39773 | False |
39774 rows × 1 columns
df.head()
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | |
---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 | False | False |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 | False | False |
Wow, we did a really great job! Let’s try another cuisine
Step 1: Preparing our data
Creating labels that scikit-learn can use
Our cuisine is , so we’ll do 0
and 1
as to whether it’s that cuisine or not
def make_label(cuisine):
if cuisine == "brazilian":
return 1
else:
return 0
df['is_brazilian'] = df['cuisine'].apply(make_label)
df.head(2)
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | |
---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 |
Creating features that scikit-learn can use
It’s Bernoulli Naive Bayes, so it’s True
and False
df['has_water'] = df['ingredient_list'].str.contains('water')
df['has_salt'] = df['ingredient_list'].str.contains('salt')
df.head(2)
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | has_water | has_salt | |
---|---|---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 | False | True |
Step 2: Create the test/train split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['has_water', 'has_salt']], # the first is our FEATURES
df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
test_size=0.2) # 80% training, 20% testing
Step 3: Create classifier, train and test
from sklearn import naive_bayes
# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()
# Fit with our training data
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.98821458876771739
clf.score(X_test, y_test)
0.9884349465744815
Dummy Classifier to see worst possible performance
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy='most_frequent')
# Fit with our training data
dummy_clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
dummy_clf.score(X_train, y_train)
0.98821458876771739
dummy_clf.score(X_test, y_test)
0.9884349465744815
We just got destroyed by math: let’s actually understand Naive Bayes
Naive Bayes gives you back a probability for each possible label - so, % chance that it’s brazilian vs. the % chance that it is not brazilian. We’ll use this to see what went wrong.
Math stuff
Naive Bayes is all about calculating the probability of “B given A”, a.k.a., the chance of B being true if A is true.
-
**Bayes Theorem:**
P(BA) = P(A and B)/P(A)` P(A)
means “what is the probability of A being true?”P(B|A)
means “if A is true, what is the probability of B being true?”P(A and B)
means “what is the probability of both A and B being true?”
Example: We have a recipe and it has water in it. Is it brazilian?
Hypothesis one: the recipe is brazilian
P(B|A)
would be “if it contains water, what is the chance that it is brazilian cuisine?”P(A and B)
would be “what is the chance that it contains both water and is brazilian?”P(A)
would be “what is the chance that this contains water?”
# P(B|A) = P(A and B)/P(A)
# P(A and B)
# Probability that a recipe has water and is brazilian
# How many recipes have water AND are brazilian?
len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])
109
# P(A)
len(df['has_water'])
39774
# P(B|A)
# The chance that a recipe is brazilian if it has water in it
109/39774
0.0027404837330919697
Hypothesis two: the recipe is NOT brazilian
P(B|A)
would be “if it contains water, what is the chance that it is NOT brazilian cuisine?”P(A and B)
would be “what is the chance that it contains both water and is NOT brazilian?”P(A)
would be “what is the chance that this contains water?”
# P(A and B)
# Probability that a recipe has water and is NOT brazilian
# How many recipes have water AND are NOT brazilian?
len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])
9385
# P(A)
# How many recipes have water?
len(df['has_water'])
39774
# P(B|A)
# The chance that a recipe is NOT brazilian if it has water in it
9385/39774
0.2359581636244783
What this boils down to
No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn’t really matter, it’s probably not brazilian.
len(df[df['cuisine'] == 'brazilian'])
467
len(df)
39774
# Only a little bit over 1% of our recipes are brazilian
# so even though it ALWAYS say it "not brazilian", it's usually right
467/39774
0.011741338562880274
1 - 467/39774
0.9882586614371197
Let’s fix up our labels
Before we had this:
def make_label(cuisine):
if cuisine == "brazilian":
return 1
else:
return 0
which does not scale well. If we wanted to add in more different cuisines, we’d need to keep adding in else ifs again and again and again until our fingers fell off. And we’d probably misspell something. And if we’re anything, it’s LAZY.
LabelEncoder to the rescue: Converts categories into numeric labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# LabelEncoder has two parts: FIT and TRANSFORM
# FIT learns all of the possible labels
# TRANSFORM takes a list of categories and converts them into numbers
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])
LabelEncoder()
# Get the labels out as numbers
le.transform(['orange', 'blue', 'yellow'])
array([1, 0, 3])
# Send the label encoder each and every cuisine
le.fit(df['cuisine'])
LabelEncoder()
le.transform(df['cuisine'])
array([ 6, 16, 4, ..., 8, 3, 13])
df['cuisine_label'] = le.transform(df['cuisine'])
df.head(3)
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | has_water | has_salt | cuisine_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 | False | False | 6 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 | False | True | 16 |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False | 0 | False | True | 4 |
Let’s train and test with our new labels
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['has_water', 'has_salt']], # the first is our FEATURES
df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
test_size=0.2) # 80% training, 20% testing
from sklearn import naive_bayes
# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()
# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.19840346962506678
clf.score(X_test, y_test)
0.20251414204902576
Let’s add some more features to see if we can do a better job
Right now I’m only looking at water and salt which doesn’t tell you much, maybe you’re looking at tortillas or cumin or soy sauce which tells you a little bit more.
df['has_miso'] = df['ingredient_list'].str.contains("miso")
df['has_soy_sauce'] = df['ingredient_list'].str.contains("soy sauce")
df['has_cilantro'] = df['ingredient_list'].str.contains("cilantro")
df['has_black_olives'] = df['ingredient_list'].str.contains("black olives")
df['has_tortillas'] = df['ingredient_list'].str.contains("tortillas")
df['has_turmeric'] = df['ingredient_list'].str.contains("turmeric")
df['has_pistachios'] = df['ingredient_list'].str.contains("pistachios")
df['has_lemongrass'] = df['ingredient_list'].str.contains("lemongrass")
Our new feature set is!!! df[['has_spaghetti', 'has_miso', 'has_soy_sauce',
'has_cilantro','has_black_olives','has_tortillas','has_turmeric',
'has_pistachios','has_lemongrass']]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES
df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
test_size=0.2) # 80% training, 20% testing
from sklearn import naive_bayes
# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()
# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.37232471165027187
clf.score(X_test, y_test)
0.36379635449402892
This is taking forever, please let there be an automatic way to pick out all
of the words
from sklearn.feature_extraction.text import CountVectorizer
# STEP ONE: .fit to learn all of the words
# STEP TWO: .transform to turn a sentence into numbers
#vectorizer = CountVectorizer()
# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'
# Only pick the top 3000 most frequent ngrams
vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)
# We have some sentences
# We're going to feed it to the vectorizer
# and it's going to learn all of the words
sentences = [
"cats are cool",
"dogs are cool"
]
vectorizer.fit(sentences)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=3000, min_df=1,
ngram_range=(1, 2), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
# We're going to take some sentences and feed it to the vectorizer
# and its' going to convert it into numbers
vectorizer.transform(sentences)
<2x7 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
# But it looks bad to look at so I'll use .toarray()
vectorizer.transform(sentences).toarray()
array([[1, 1, 1, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1, 1]])
# In our case, our text is the list of ingredients. We can get it through
df['ingredient_list'].head()
0 romaine lettuce, black olives, grape tomatoes,...
1 plain flour, ground pepper, salt, tomatoes, gr...
2 eggs, pepper, salt, mayonaise, cooking oil, gr...
3 water, vegetable oil, wheat, salt
4 black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object
# Dear vectorizer, please learn all of these words
vectorizer.fit(df['ingredient_list'])
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=3000, min_df=1,
ngram_range=(1, 2), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
# Dear vectorizer, please convert ingredient_list into features
# That we can do machine learning on
every_single_word_features = vectorizer.transform(df['ingredient_list'])
every_single_word_features
<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
with 1243216 stored elements in Compressed Sparse Row format>
Now let’s try with our new complete labels and our new complete features that
includes every single word
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
every_single_word_features,
df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
test_size=0.2) # 80% training, 20% testing
This is Naive Bayes with every word as a feature pushed through the
CountVectorizer
print("This is Naive Bayes")
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
%time clf.fit(X_train, y_train)
# How does it do on the training data?
print("Training score: (stuff it already knows)", clf.score(X_train, y_train))
# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", clf.score(X_test, y_test))
This is Naive Bayes
CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms
Wall time: 109 ms
Training score: (stuff it already knows) 0.714384487256
Testing score: (stuff it hasn't seen before): 0.680578252671
But maybe it’s just chance? Let’s try the Dummy Classifier
from sklearn.dummy import DummyClassifier
print("This is the Dummy Classifier")
dummy_clf = DummyClassifier()
%time dummy_clf.fit(X_train, y_train)
# How does it do on the training data?
print("Training score: (stuff it already knows)", dummy_clf.score(X_train, y_train))
# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", dummy_clf.score(X_test, y_test))
This is the Dummy Classifier
CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms
Wall time: 2.41 ms
Training score: (stuff it already knows) 0.100254564883
Testing score: (stuff it hasn't seen before): 0.0999371464488
This is a Decision Tree with every single feature from the CountVectorizer
print("This is a Decision Tree")
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()
%time tree_clf.fit(X_train, y_train)
# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))
# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))
This is a Decision Tree
CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s
Wall time: 19.7 s
Training score: (stuff it already knows) 0.999780005657
Testing score: (stuff it hasn't seen before): 0.638592080453
from sklearn.ensemble import RandomForestClassifier
print("This is a Random Forest")
tree_clf = RandomForestClassifier()
%time tree_clf.fit(X_train, y_train)
# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))
# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))
This is a Random Forest
CPU times: user 10 s, sys: 288 ms, total: 10.3 s
Wall time: 13.6 s
Training score: (stuff it already knows) 0.992645903391
Testing score: (stuff it hasn't seen before): 0.706096794469
How do you do this in the real world with new data?
every_single_word_features = vectorizer.transform(df['ingredient_list'])
# Import the Naive bayes thing
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
# Give the classifier EVERYTHING we know, not holding back anything
clf.fit(every_single_word_features, df['cuisine_label'])
# We have some new stuff we have not categorized
incoming_recipes = [
"spaghetti tomato sauce garlic onion water",
"soy sauce ginger sugar butter",
"green papaya thai chilies palm sugar",
"butter oil salt black pepper water milk bubblegumpie"
]
features_for_new_recipes = vectorizer.transform(incoming_recipes)
features_for_new_recipes
<4x3000 sparse matrix of type '<class 'numpy.int64'>'
with 35 stored elements in Compressed Sparse Row format>
predictions = clf.predict(features_for_new_recipes)
predictions
array([ 4, 11, 4, 16])
# The predictions are all categories that the labelencoder decided on
# Let's convert those numeric ones back into real fun cuisine words
le.inverse_transform(predictions)
array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)