Algorithms: Rules of Play

Name of the algorithm
What it’s used for (classification, clustering, maybe other things?)
Why is it better/worse than other classification/clustering/etc algorithms
How to get our data into a format that is good for that algorithm
REALISTIC data sets
What the output means technically
What the output means in like real life language and practically speaking
What kind of datasets you use this algorithm for
Examples of when it was used in journalism OR maybe could have been used
Examples of when it was used period
Pitfalls
Maybe maybe maybe a little bit of math
How to ground them for a less technical audience and to help engage them in what the algorithm is doing

Naive Bayes

Download and extract recipes.csv.zip from #algorithms and start a new Jupyter Notebook!!!!

Classification algorithm - spam filter

The more spammy words that are in an email, the more like it is to be spam

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("../recipes.csv")
df.head()

	cuisine	id	ingredient_list
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...
3	indian	22213	water, vegetable oil, wheat, salt
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...

QUESTION ONE: What are we doing and why are we using Naive Bayes?

We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?

We’re going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it’s italian because we love italian food and we only want to eat italian food.

RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.

For clustering

You’ll get a lot of documents
You feed it to an algorithm, tell it create x number of categories
The machine gives you back categories whether they make sense or not

For classification (which we are doing now)

You’ll get a lot of documents
You’ll classify some of them into categories that you know and love
You’ll ask the algorithm what categories a new bunch of unlabeled documents end up in

All mean the same thing: CATEGORY = CLASS = LABEL

The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).

How does Naive Bayes work?

NAIVE BAYES WORKS WITH TEXT (kind of)

Bayes Theorem (kind of)

If you see a word that is normally in a spam email, there’s a higher chance it’s spam
If you see a word that is normally in a non-spam email, there’s a higher chance it’s not spam

Naive: every word/ingredient/etc is independent of any other word

FOR US: If you see ingredients that are normally in italian food, it’s probably italian

Secret trick: you can’t just use text, you have to convert into numbers

Types of Naive Bayes

Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.

Multinominal Naive Bayes - (multiple numbers): You count the words. You care about whether a word appears once or twice or three times or ten times. This is better for long passages

Bernoulli Naive Bayes - True/False Bayes: You only care if the word shows up (True) or it doesn’t show up (False) - this is better for short passages

STEP ONE: Let’s convert our text data into numerical data

df.head()

	cuisine	id	ingredient_list
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...
3	indian	22213	water, vegetable oil, wheat, salt
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...

Our problem: Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn’t matter

Two things to convert into numbers:

Our labels (a.k.a. the categories everything belongs in)
Our features

Converting our labels into numbers

We have two labels

italian = 1
not italian = 0

df.head()

	cuisine	id	ingredient_list
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...
3	indian	22213	water, vegetable oil, wheat, salt
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...

def make_label(cuisine):
    if cuisine == "italian":
        return 1
    else:
        return 0

df['label'] = df['cuisine'].apply(make_label)
df.head(10)

	cuisine	id	ingredient_list	label
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	0
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	0
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...	0
3	indian	22213	water, vegetable oil, wheat, salt	0
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...	0
5	jamaican	6602	plain flour, sugar, butter, eggs, fresh ginger...	0
6	spanish	42779	olive oil, salt, medium shrimp, pepper, garlic...	0
7	italian	3735	sugar, pistachio nuts, white almond bark, flou...	1
8	mexican	16903	olive oil, purple onion, fresh pineapple, pork...	0
9	italian	12734	chopped tomatoes, fresh basil, garlic, extra-v...	1

Converting our features into numbers

Feature selection: The process of selecting the features that matter, in this case - what ingredients do we want to look at?

Our feature is going to be: whether it has spaghetti or not

df['has_spaghetti'] = df['ingredient_list'].str.contains("spaghetti")
df['has_curry_powder'] = df['ingredient_list'].str.contains("curry powder")
df.head(10)

	cuisine	id	ingredient_list	label	has_spaghetti	has_curry_powder
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	0	False	False
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	0	False	False
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...	0	False	False
3	indian	22213	water, vegetable oil, wheat, salt	0	False	False
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...	0	False	False
5	jamaican	6602	plain flour, sugar, butter, eggs, fresh ginger...	0	False	False
6	spanish	42779	olive oil, salt, medium shrimp, pepper, garlic...	0	False	False
7	italian	3735	sugar, pistachio nuts, white almond bark, flou...	1	False	False
8	mexican	16903	olive oil, purple onion, fresh pineapple, pork...	0	False	False
9	italian	12734	chopped tomatoes, fresh basil, garlic, extra-v...	1	False	False

Let’s run our tests

Let’s feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!

Looking at our labels

We stored it in label, and if it’s 0 it’s not italian, if it’s 1 it is Italian

df['label'].head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

Look at our features

We have two features has_spaghetti and has_curry_powder.

df[['has_spaghetti', 'has_curry_powder']].head()

	has_spaghetti	has_curry_powder
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False

Now let’s finally do this

# We need to split into training and testing data
from sklearn.cross_validation import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

# Oh hey, it's just our features from the dataframe
X_train

	has_spaghetti	has_curry_powder
18816	False	False
30480	False	False
19110	False	False
29312	False	False
23782	False	False
7907	False	False
2456	False	False
27221	False	False
5228	False	False
37623	False	False
18641	False	False
9029	False	False
5549	False	False
21559	False	False
12168	False	False
4836	False	False
3560	False	False
21578	False	False
33579	False	False
5965	False	False
32581	False	False
5274	False	False
1544	False	False
10885	False	False
22168	False	False
29798	False	False
31228	False	False
4636	False	False
38889	False	False
39444	False	False
...	...	...
19956	False	False
14863	False	False
8335	False	False
21372	False	False
8720	False	False
11752	False	False
10551	False	False
37474	False	False
7905	False	False
5923	False	False
14526	False	False
673	False	False
30444	False	False
15322	False	False
5476	False	False
37545	False	False
32634	False	False
36936	False	False
18970	False	False
5622	False	False
10731	False	False
37097	False	False
5822	False	False
35856	False	False
7579	False	False
27918	False	False
8601	False	False
5245	False	False
39665	False	False
13013	False	False

31819 rows × 2 columns

# X is always the features, whether it's for training or for testing
X_test

	has_spaghetti	has_curry_powder
23827	False	False
24607	False	False
16829	False	False
6473	False	False
23662	False	False
19742	False	False
37244	False	False
19552	False	False
6361	False	False
6786	False	False
27241	False	False
9034	False	False
34423	False	False
33399	False	False
19641	False	True
15389	False	True
11627	False	False
25811	False	False
22079	False	False
5254	False	False
22499	False	False
18948	False	False
13672	False	False
31390	False	False
26623	False	False
36470	False	False
14916	False	False
22337	False	False
27339	False	False
38540	False	False
...	...	...
3409	False	False
38281	False	False
12014	False	False
10908	False	False
4647	False	False
22629	False	False
32925	False	False
20743	False	False
25604	False	False
34821	False	False
38273	False	False
24241	False	False
28217	False	False
25094	False	False
9433	False	False
3755	False	False
12877	False	False
37839	False	False
30193	False	False
5866	False	False
22191	False	False
29451	False	True
29878	False	False
26103	False	False
9126	False	False
32127	False	False
34047	False	False
3324	False	False
31076	False	False
104	False	False

7955 rows × 2 columns

len(X_train)

len(X_test)

# We're testing on ~8000 and training on ~32000

# y_train is our labels that we are training one
y_train

18816    0
30480    0
19110    0
29312    1
23782    0
7907     0
2456     0
27221    0
5228     0
37623    0
18641    0
9029     0
5549     0
21559    0
12168    0
4836     0
3560     0
21578    0
33579    0
5965     0
32581    0
5274     1
1544     1
10885    0
22168    0
29798    1
31228    0
4636     1
38889    0
39444    0
        ..
19956    0
14863    1
8335     0
21372    1
8720     0
11752    0
10551    1
37474    1
7905     1
5923     1
14526    1
673      0
30444    0
15322    0
5476     0
37545    1
32634    0
36936    0
18970    0
5622     0
10731    0
37097    0
5822     0
35856    0
7579     0
27918    1
8601     0
5245     0
39665    0
13013    0
Name: label, dtype: int64

# And y_test is the labels we're testing on
y_test

23827    0
24607    0
16829    1
6473     0
23662    0
19742    0
37244    0
19552    1
6361     0
6786     0
27241    0
9034     1
34423    1
33399    0
19641    0
15389    0
11627    0
25811    0
22079    1
5254     0
22499    0
18948    1
13672    1
31390    0
26623    1
36470    0
14916    1
22337    0
27339    0
38540    0
        ..
3409     1
38281    0
12014    1
10908    0
4647     0
22629    0
32925    0
20743    0
25604    1
34821    0
38273    1
24241    1
28217    0
25094    0
9433     0
3755     1
12877    0
37839    0
30193    0
5866     0
22191    0
29451    0
29878    0
26103    0
9126     0
32127    0
34047    0
3324     0
31076    0
104      0
Name: label, dtype: int64

print("Length of training labels:", len(y_train))
print("Length of testing labels:", len(y_test))
print("Length of training features:", len(X_train))
print("Length of testing features:", len(X_test))

Length of training labels: 31819
Length of testing labels: 7955
Length of training features: 31819
Length of testing features: 7955

Basically all that happened was train_test_split took us from having a nice dataframe where everything was together and split it into two groups of two - separated our labels vs. our features, and our training data vs our testing data.

Back to actually doing our fitting etc

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

# Import naive_bayes to get access to ALL kinds of naive bayes classifiers
# But REMEMBER we're using Bernoulli because it's for true/false which is fine
# for small passages
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Feed the classifier two things:
#   * our training features (X_train)
#   * our training labels (y_train)
# To help it study for the exam later when we test it
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

# This looks ugly but in theory it's what every recipe is
# All those zeroes = not italian
# We know the first three aren't italian and the last three aren't italian
clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

# Naive Bayes can't overfit, really
# It can't "study too hard" it can't "memorize the questions"
# (a decision tree can)
# So if we give it the training data back it will get some wrong
clf.score(X_train, y_train)

0.81083629278104274

clf.score(X_test, y_test)

0.80905091137649277

df['cuisine'].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

df['has_spaghetti']

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27        True
28       False
29       False
         ...  
39744    False
39745    False
39746    False
39747    False
39748    False
39749    False
39750    False
39751    False
39752    False
39753    False
39754    False
39755    False
39756    False
39757    False
39758    False
39759    False
39760    False
39761    False
39762    False
39763    False
39764    False
39765    False
39766    False
39767     True
39768    False
39769    False
39770    False
39771    False
39772    False
39773    False
Name: has_spaghetti, dtype: bool

#df[['has_spaghetti', 'has_curry_powder']]
df[['has_spaghetti']]

	has_spaghetti
0	False
1	False
2	False
3	False
4	False
5	False
6	False
7	False
8	False
9	False
10	False
11	False
12	False
13	False
14	False
15	False
16	False
17	False
18	False
19	False
20	False
21	False
22	False
23	False
24	False
25	False
26	False
27	True
28	False
29	False
...	...
39744	False
39745	False
39746	False
39747	False
39748	False
39749	False
39750	False
39751	False
39752	False
39753	False
39754	False
39755	False
39756	False
39757	False
39758	False
39759	False
39760	False
39761	False
39762	False
39763	False
39764	False
39765	False
39766	False
39767	True
39768	False
39769	False
39770	False
39771	False
39772	False
39773	False

39774 rows × 1 columns

df.head()

	cuisine	id	ingredient_list	has_spaghetti	has_curry_powder
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	False	False
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	False	False
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...	False	False
3	indian	22213	water, vegetable oil, wheat, salt	False	False
4	indian	13162	black pepper, shallots, cornflour, cayenne pep...	False	False

Wow, we did a really great job! Let’s try another cuisine

Step 1: Preparing our data

Creating labels that scikit-learn can use

Our cuisine is , so we’ll do 0 and 1 as to whether it’s that cuisine or not

def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

df['is_brazilian'] = df['cuisine'].apply(make_label)

df.head(2)

	cuisine	id	ingredient_list	label	has_spaghetti	has_curry_powder	is_brazilian
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	0	False	False	0
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	0	False	False	0

Creating features that scikit-learn can use

It’s Bernoulli Naive Bayes, so it’s True and False

df['has_water'] = df['ingredient_list'].str.contains('water')
df['has_salt'] = df['ingredient_list'].str.contains('salt')

df.head(2)

	cuisine	id	ingredient_list	label	has_spaghetti	has_curry_powder	is_brazilian	has_water	has_salt
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	0	False	False	0	False	False
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	0	False	False	0	False	True

Step 2: Create the test/train split

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

Step 3: Create classifier, train and test

from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Fit with our training data
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

clf.score(X_train, y_train)

0.98821458876771739

clf.score(X_test, y_test)

0.9884349465744815

Dummy Classifier to see worst possible performance

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy='most_frequent')

# Fit with our training data
dummy_clf.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

dummy_clf.score(X_train, y_train)

0.98821458876771739

dummy_clf.score(X_test, y_test)

0.9884349465744815

We just got destroyed by math: let’s actually understand Naive Bayes

Naive Bayes gives you back a probability for each possible label - so, % chance that it’s brazilian vs. the % chance that it is not brazilian. We’ll use this to see what went wrong.

Math stuff

Naive Bayes is all about calculating the probability of “B given A”, a.k.a., the chance of B being true if A is true.

**Bayes Theorem:** P(B A) = P(A and B)/P(A)`
P(A) means “what is the probability of A being true?”
P(B|A) means “if A is true, what is the probability of B being true?”
P(A and B) means “what is the probability of both A and B being true?”

Example: We have a recipe and it has water in it. Is it brazilian?

Hypothesis one: the recipe is brazilian

P(B|A) would be “if it contains water, what is the chance that it is brazilian cuisine?”
P(A and B) would be “what is the chance that it contains both water and is brazilian?”
P(A) would be “what is the chance that this contains water?”

# P(B|A) = P(A and B)/P(A)

# P(A and B)
# Probability that a recipe has water and is brazilian

# How many recipes have water AND are brazilian?
len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])

# P(A)
len(df['has_water'])

# P(B|A)
# The chance that a recipe is brazilian if it has water in it
109/39774

0.0027404837330919697

Hypothesis two: the recipe is NOT brazilian

P(B|A) would be “if it contains water, what is the chance that it is NOT brazilian cuisine?”
P(A and B) would be “what is the chance that it contains both water and is NOT brazilian?”
P(A) would be “what is the chance that this contains water?”

# P(A and B)
# Probability that a recipe has water and is NOT brazilian

# How many recipes have water AND are NOT brazilian?
len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])

# P(A)
# How many recipes have water?
len(df['has_water'])

# P(B|A)
# The chance that a recipe is NOT brazilian if it has water in it
9385/39774

0.2359581636244783

What this boils down to

No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn’t really matter, it’s probably not brazilian.

len(df[df['cuisine'] == 'brazilian'])

len(df)

# Only a little bit over 1% of our recipes are brazilian
# so even though it ALWAYS say it "not brazilian", it's usually right
467/39774

0.011741338562880274

1 - 467/39774

0.9882586614371197

Let’s fix up our labels

Before we had this:

def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

which does not scale well. If we wanted to add in more different cuisines, we’d need to keep adding in else ifs again and again and again until our fingers fell off. And we’d probably misspell something. And if we’re anything, it’s LAZY.

LabelEncoder to the rescue: Converts categories into numeric labels

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# LabelEncoder has two parts: FIT and TRANSFORM
# FIT learns all of the possible labels
# TRANSFORM takes a list of categories and converts them into numbers

# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

LabelEncoder()

# Get the labels out as numbers
le.transform(['orange', 'blue', 'yellow'])

array([1, 0, 3])

# Send the label encoder each and every cuisine
le.fit(df['cuisine'])

LabelEncoder()

le.transform(df['cuisine'])

array([ 6, 16,  4, ...,  8,  3, 13])

df['cuisine_label'] = le.transform(df['cuisine'])
df.head(3)

	cuisine	id	ingredient_list	has_spaghetti	has_curry_powder	has_water	has_salt	cuisine_label
0	greek	10259	romaine lettuce, black olives, grape tomatoes,...	False	False	False	False	6
1	southern_us	25693	plain flour, ground pepper, salt, tomatoes, gr...	False	False	False	True	16
2	filipino	20130	eggs, pepper, salt, mayonaise, cooking oil, gr...	False	False	False	True	4

Let’s train and test with our new labels

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

clf.score(X_train, y_train)

0.19840346962506678

clf.score(X_test, y_test)

0.20251414204902576

Let’s add some more features to see if we can do a better job

Right now I’m only looking at water and salt which doesn’t tell you much, maybe you’re looking at tortillas or cumin or soy sauce which tells you a little bit more.

df['has_miso'] = df['ingredient_list'].str.contains("miso")
df['has_soy_sauce'] = df['ingredient_list'].str.contains("soy sauce")
df['has_cilantro'] = df['ingredient_list'].str.contains("cilantro")
df['has_black_olives'] = df['ingredient_list'].str.contains("black olives")
df['has_tortillas'] = df['ingredient_list'].str.contains("tortillas")
df['has_turmeric'] = df['ingredient_list'].str.contains("turmeric")
df['has_pistachios'] = df['ingredient_list'].str.contains("pistachios")
df['has_lemongrass'] = df['ingredient_list'].str.contains("lemongrass")

Our new feature set is!!! df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

clf.score(X_train, y_train)

0.37232471165027187

clf.score(X_test, y_test)

0.36379635449402892

This is taking forever, please let there be an automatic way to pick out all

of the words

from sklearn.feature_extraction.text import CountVectorizer

# STEP ONE: .fit to learn all of the words
# STEP TWO: .transform to turn a sentence into numbers

#vectorizer = CountVectorizer()
# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'
# Only pick the top 3000 most frequent ngrams
vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)

# We have some sentences
# We're going to feed it to the vectorizer
# and it's going to learn all of the words
sentences = [
    "cats are cool",
    "dogs are cool"
]
vectorizer.fit(sentences)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# We're going to take some sentences and feed it to the vectorizer
# and its' going to convert it into numbers
vectorizer.transform(sentences)

<2x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

# But it looks bad to look at so I'll use .toarray()
vectorizer.transform(sentences).toarray()

array([[1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1]])

# In our case, our text is the list of ingredients. We can get it through
df['ingredient_list'].head()

0    romaine lettuce, black olives, grape tomatoes,...
1    plain flour, ground pepper, salt, tomatoes, gr...
2    eggs, pepper, salt, mayonaise, cooking oil, gr...
3                    water, vegetable oil, wheat, salt
4    black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object

# Dear vectorizer, please learn all of these words
vectorizer.fit(df['ingredient_list'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# Dear vectorizer, please convert ingredient_list into features
# That we can do machine learning on

every_single_word_features = vectorizer.transform(df['ingredient_list'])
every_single_word_features

<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 1243216 stored elements in Compressed Sparse Row format>

Now let’s try with our new complete labels and our new complete features that

includes every single word

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    every_single_word_features,
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

This is Naive Bayes with every word as a feature pushed through the

CountVectorizer

print("This is Naive Bayes")

from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
%time clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", clf.score(X_test, y_test))

This is Naive Bayes
CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms
Wall time: 109 ms
Training score: (stuff it already knows) 0.714384487256
Testing score: (stuff it hasn't seen before): 0.680578252671

But maybe it’s just chance? Let’s try the Dummy Classifier

from sklearn.dummy import DummyClassifier

print("This is the Dummy Classifier")

dummy_clf = DummyClassifier()
%time dummy_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", dummy_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", dummy_clf.score(X_test, y_test))

This is the Dummy Classifier
CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms
Wall time: 2.41 ms
Training score: (stuff it already knows) 0.100254564883
Testing score: (stuff it hasn't seen before): 0.0999371464488

This is a Decision Tree with every single feature from the CountVectorizer

print("This is a Decision Tree")

from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))

This is a Decision Tree
CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s
Wall time: 19.7 s
Training score: (stuff it already knows) 0.999780005657
Testing score: (stuff it hasn't seen before): 0.638592080453

from sklearn.ensemble import RandomForestClassifier

print("This is a Random Forest")

tree_clf = RandomForestClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))

This is a Random Forest
CPU times: user 10 s, sys: 288 ms, total: 10.3 s
Wall time: 13.6 s
Training score: (stuff it already knows) 0.992645903391
Testing score: (stuff it hasn't seen before): 0.706096794469

How do you do this in the real world with new data?

every_single_word_features = vectorizer.transform(df['ingredient_list'])

# Import the Naive bayes thing
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()

# Give the classifier EVERYTHING we know, not holding back anything
clf.fit(every_single_word_features, df['cuisine_label'])

# We have some new stuff we have not categorized
incoming_recipes = [
    "spaghetti tomato sauce garlic onion water",
    "soy sauce ginger sugar butter",
    "green papaya thai chilies palm sugar",
    "butter oil salt black pepper water milk bubblegumpie"
]

features_for_new_recipes = vectorizer.transform(incoming_recipes)
features_for_new_recipes

<4x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

predictions = clf.predict(features_for_new_recipes)
predictions

array([ 4, 11,  4, 16])

# The predictions are all categories that the labelencoder decided on
# Let's convert those numeric ones back into real fun cuisine words
le.inverse_transform(predictions)

array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)