Algorithms: Rules of Play

  1. Name of the algorithm
  2. What it’s used for (classification, clustering, maybe other things?)
  3. Why is it better/worse than other classification/clustering/etc algorithms
  4. How to get our data into a format that is good for that algorithm
  5. REALISTIC data sets
  6. What the output means technically
  7. What the output means in like real life language and practically speaking
  8. What kind of datasets you use this algorithm for
  9. Examples of when it was used in journalism OR maybe could have been used
  10. Examples of when it was used period
  11. Pitfalls
  12. Maybe maybe maybe a little bit of math
  13. How to ground them for a less technical audience and to help engage them in what the algorithm is doing

Naive Bayes

Download and extract recipes.csv.zip from #algorithms and start a new Jupyter Notebook!!!!

Classification algorithm - spam filter

The more spammy words that are in an email, the more like it is to be spam

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("../recipes.csv")
df.head()
cuisine id ingredient_list
0 greek 10259 romaine lettuce, black olives, grape tomatoes,...
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...
3 indian 22213 water, vegetable oil, wheat, salt
4 indian 13162 black pepper, shallots, cornflour, cayenne pep...

QUESTION ONE: What are we doing and why are we using Naive Bayes?

We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?

We’re going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it’s italian because we love italian food and we only want to eat italian food.

RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.

For clustering

  1. You’ll get a lot of documents
  2. You feed it to an algorithm, tell it create x number of categories
  3. The machine gives you back categories whether they make sense or not

For classification (which we are doing now)

  1. You’ll get a lot of documents
  2. You’ll classify some of them into categories that you know and love
  3. You’ll ask the algorithm what categories a new bunch of unlabeled documents end up in

All mean the same thing: CATEGORY = CLASS = LABEL

The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).

How does Naive Bayes work?

NAIVE BAYES WORKS WITH TEXT (kind of)

Bayes Theorem (kind of)

  • If you see a word that is normally in a spam email, there’s a higher chance it’s spam
  • If you see a word that is normally in a non-spam email, there’s a higher chance it’s not spam

Naive: every word/ingredient/etc is independent of any other word

FOR US: If you see ingredients that are normally in italian food, it’s probably italian

Secret trick: you can’t just use text, you have to convert into numbers

Types of Naive Bayes

Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.

Multinominal Naive Bayes - (multiple numbers): You count the words. You care about whether a word appears once or twice or three times or ten times. This is better for long passages

Bernoulli Naive Bayes - True/False Bayes: You only care if the word shows up (True) or it doesn’t show up (False) - this is better for short passages

STEP ONE: Let’s convert our text data into numerical data

df.head()
cuisine id ingredient_list
0 greek 10259 romaine lettuce, black olives, grape tomatoes,...
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...
3 indian 22213 water, vegetable oil, wheat, salt
4 indian 13162 black pepper, shallots, cornflour, cayenne pep...

Our problem: Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn’t matter

Two things to convert into numbers:

  • Our labels (a.k.a. the categories everything belongs in)
  • Our features

Converting our labels into numbers

We have two labels

  • italian = 1
  • not italian = 0
df.head()
cuisine id ingredient_list
0 greek 10259 romaine lettuce, black olives, grape tomatoes,...
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...
3 indian 22213 water, vegetable oil, wheat, salt
4 indian 13162 black pepper, shallots, cornflour, cayenne pep...
def make_label(cuisine):
    if cuisine == "italian":
        return 1
    else:
        return 0
df['label'] = df['cuisine'].apply(make_label)
df.head(10)
cuisine id ingredient_list label
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... 0
3 indian 22213 water, vegetable oil, wheat, salt 0
4 indian 13162 black pepper, shallots, cornflour, cayenne pep... 0
5 jamaican 6602 plain flour, sugar, butter, eggs, fresh ginger... 0
6 spanish 42779 olive oil, salt, medium shrimp, pepper, garlic... 0
7 italian 3735 sugar, pistachio nuts, white almond bark, flou... 1
8 mexican 16903 olive oil, purple onion, fresh pineapple, pork... 0
9 italian 12734 chopped tomatoes, fresh basil, garlic, extra-v... 1

Converting our features into numbers

Feature selection: The process of selecting the features that matter, in this case - what ingredients do we want to look at?

Our feature is going to be: whether it has spaghetti or not

df['has_spaghetti'] = df['ingredient_list'].str.contains("spaghetti")
df['has_curry_powder'] = df['ingredient_list'].str.contains("curry powder")
df.head(10)
cuisine id ingredient_list label has_spaghetti has_curry_powder
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0 False False
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0 False False
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... 0 False False
3 indian 22213 water, vegetable oil, wheat, salt 0 False False
4 indian 13162 black pepper, shallots, cornflour, cayenne pep... 0 False False
5 jamaican 6602 plain flour, sugar, butter, eggs, fresh ginger... 0 False False
6 spanish 42779 olive oil, salt, medium shrimp, pepper, garlic... 0 False False
7 italian 3735 sugar, pistachio nuts, white almond bark, flou... 1 False False
8 mexican 16903 olive oil, purple onion, fresh pineapple, pork... 0 False False
9 italian 12734 chopped tomatoes, fresh basil, garlic, extra-v... 1 False False

Let’s run our tests

Let’s feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!

Looking at our labels

We stored it in label, and if it’s 0 it’s not italian, if it’s 1 it is Italian

df['label'].head()
0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

Look at our features

We have two features has_spaghetti and has_curry_powder.

df[['has_spaghetti', 'has_curry_powder']].head()
has_spaghetti has_curry_powder
0 False False
1 False False
2 False False
3 False False
4 False False

Now let’s finally do this

# We need to split into training and testing data
from sklearn.cross_validation import train_test_split
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing
# Oh hey, it's just our features from the dataframe
X_train
has_spaghetti has_curry_powder
18816 False False
30480 False False
19110 False False
29312 False False
23782 False False
7907 False False
2456 False False
27221 False False
5228 False False
37623 False False
18641 False False
9029 False False
5549 False False
21559 False False
12168 False False
4836 False False
3560 False False
21578 False False
33579 False False
5965 False False
32581 False False
5274 False False
1544 False False
10885 False False
22168 False False
29798 False False
31228 False False
4636 False False
38889 False False
39444 False False
... ... ...
19956 False False
14863 False False
8335 False False
21372 False False
8720 False False
11752 False False
10551 False False
37474 False False
7905 False False
5923 False False
14526 False False
673 False False
30444 False False
15322 False False
5476 False False
37545 False False
32634 False False
36936 False False
18970 False False
5622 False False
10731 False False
37097 False False
5822 False False
35856 False False
7579 False False
27918 False False
8601 False False
5245 False False
39665 False False
13013 False False

31819 rows × 2 columns

# X is always the features, whether it's for training or for testing
X_test
has_spaghetti has_curry_powder
23827 False False
24607 False False
16829 False False
6473 False False
23662 False False
19742 False False
37244 False False
19552 False False
6361 False False
6786 False False
27241 False False
9034 False False
34423 False False
33399 False False
19641 False True
15389 False True
11627 False False
25811 False False
22079 False False
5254 False False
22499 False False
18948 False False
13672 False False
31390 False False
26623 False False
36470 False False
14916 False False
22337 False False
27339 False False
38540 False False
... ... ...
3409 False False
38281 False False
12014 False False
10908 False False
4647 False False
22629 False False
32925 False False
20743 False False
25604 False False
34821 False False
38273 False False
24241 False False
28217 False False
25094 False False
9433 False False
3755 False False
12877 False False
37839 False False
30193 False False
5866 False False
22191 False False
29451 False True
29878 False False
26103 False False
9126 False False
32127 False False
34047 False False
3324 False False
31076 False False
104 False False

7955 rows × 2 columns

len(X_train)
31819
len(X_test)
7955
# We're testing on ~8000 and training on ~32000
# y_train is our labels that we are training one
y_train
18816    0
30480    0
19110    0
29312    1
23782    0
7907     0
2456     0
27221    0
5228     0
37623    0
18641    0
9029     0
5549     0
21559    0
12168    0
4836     0
3560     0
21578    0
33579    0
5965     0
32581    0
5274     1
1544     1
10885    0
22168    0
29798    1
31228    0
4636     1
38889    0
39444    0
        ..
19956    0
14863    1
8335     0
21372    1
8720     0
11752    0
10551    1
37474    1
7905     1
5923     1
14526    1
673      0
30444    0
15322    0
5476     0
37545    1
32634    0
36936    0
18970    0
5622     0
10731    0
37097    0
5822     0
35856    0
7579     0
27918    1
8601     0
5245     0
39665    0
13013    0
Name: label, dtype: int64
# And y_test is the labels we're testing on
y_test
23827    0
24607    0
16829    1
6473     0
23662    0
19742    0
37244    0
19552    1
6361     0
6786     0
27241    0
9034     1
34423    1
33399    0
19641    0
15389    0
11627    0
25811    0
22079    1
5254     0
22499    0
18948    1
13672    1
31390    0
26623    1
36470    0
14916    1
22337    0
27339    0
38540    0
        ..
3409     1
38281    0
12014    1
10908    0
4647     0
22629    0
32925    0
20743    0
25604    1
34821    0
38273    1
24241    1
28217    0
25094    0
9433     0
3755     1
12877    0
37839    0
30193    0
5866     0
22191    0
29451    0
29878    0
26103    0
9126     0
32127    0
34047    0
3324     0
31076    0
104      0
Name: label, dtype: int64
print("Length of training labels:", len(y_train))
print("Length of testing labels:", len(y_test))
print("Length of training features:", len(X_train))
print("Length of testing features:", len(X_test))
Length of training labels: 31819
Length of testing labels: 7955
Length of training features: 31819
Length of testing features: 7955

Basically all that happened was train_test_split took us from having a nice dataframe where everything was together and split it into two groups of two - separated our labels vs. our features, and our training data vs our testing data.

Back to actually doing our fitting etc

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing
# Import naive_bayes to get access to ALL kinds of naive bayes classifiers
# But REMEMBER we're using Bernoulli because it's for true/false which is fine
# for small passages
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Feed the classifier two things:
#   * our training features (X_train)
#   * our training labels (y_train)
# To help it study for the exam later when we test it
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
# This looks ugly but in theory it's what every recipe is
# All those zeroes = not italian
# We know the first three aren't italian and the last three aren't italian
clf.predict(X_test)
array([0, 0, 0, ..., 0, 0, 0])
# Naive Bayes can't overfit, really
# It can't "study too hard" it can't "memorize the questions"
# (a decision tree can)
# So if we give it the training data back it will get some wrong
clf.score(X_train, y_train)
0.81083629278104274
clf.score(X_test, y_test)
0.80905091137649277
df['cuisine'].value_counts()
italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64
df['has_spaghetti']
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27        True
28       False
29       False
         ...  
39744    False
39745    False
39746    False
39747    False
39748    False
39749    False
39750    False
39751    False
39752    False
39753    False
39754    False
39755    False
39756    False
39757    False
39758    False
39759    False
39760    False
39761    False
39762    False
39763    False
39764    False
39765    False
39766    False
39767     True
39768    False
39769    False
39770    False
39771    False
39772    False
39773    False
Name: has_spaghetti, dtype: bool
#df[['has_spaghetti', 'has_curry_powder']]
df[['has_spaghetti']]
has_spaghetti
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 True
28 False
29 False
... ...
39744 False
39745 False
39746 False
39747 False
39748 False
39749 False
39750 False
39751 False
39752 False
39753 False
39754 False
39755 False
39756 False
39757 False
39758 False
39759 False
39760 False
39761 False
39762 False
39763 False
39764 False
39765 False
39766 False
39767 True
39768 False
39769 False
39770 False
39771 False
39772 False
39773 False

39774 rows × 1 columns

df.head()
cuisine id ingredient_list label has_spaghetti has_curry_powder
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0 False False
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0 False False
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... 0 False False
3 indian 22213 water, vegetable oil, wheat, salt 0 False False
4 indian 13162 black pepper, shallots, cornflour, cayenne pep... 0 False False

Wow, we did a really great job! Let’s try another cuisine

Step 1: Preparing our data

Creating labels that scikit-learn can use

Our cuisine is , so we’ll do 0 and 1 as to whether it’s that cuisine or not

def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

df['is_brazilian'] = df['cuisine'].apply(make_label)
df.head(2)
cuisine id ingredient_list label has_spaghetti has_curry_powder is_brazilian
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0 False False 0
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0 False False 0

Creating features that scikit-learn can use

It’s Bernoulli Naive Bayes, so it’s True and False

df['has_water'] = df['ingredient_list'].str.contains('water')
df['has_salt'] = df['ingredient_list'].str.contains('salt')
df.head(2)
cuisine id ingredient_list label has_spaghetti has_curry_powder is_brazilian has_water has_salt
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0 False False 0 False False
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0 False False 0 False True

Step 2: Create the test/train split

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

Step 3: Create classifier, train and test

from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Fit with our training data
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.98821458876771739
clf.score(X_test, y_test)
0.9884349465744815

Dummy Classifier to see worst possible performance

from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy='most_frequent')

# Fit with our training data
dummy_clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
dummy_clf.score(X_train, y_train)
0.98821458876771739
dummy_clf.score(X_test, y_test)
0.9884349465744815

We just got destroyed by math: let’s actually understand Naive Bayes

Naive Bayes gives you back a probability for each possible label - so, % chance that it’s brazilian vs. the % chance that it is not brazilian. We’ll use this to see what went wrong.

Math stuff

Naive Bayes is all about calculating the probability of “B given A”, a.k.a., the chance of B being true if A is true.

  • **Bayes Theorem:** P(B A) = P(A and B)/P(A)`
  • P(A) means “what is the probability of A being true?”
  • P(B|A) means “if A is true, what is the probability of B being true?”
  • P(A and B) means “what is the probability of both A and B being true?”

Example: We have a recipe and it has water in it. Is it brazilian?

Hypothesis one: the recipe is brazilian

  • P(B|A) would be “if it contains water, what is the chance that it is brazilian cuisine?”
  • P(A and B) would be “what is the chance that it contains both water and is brazilian?”
  • P(A) would be “what is the chance that this contains water?”
# P(B|A) = P(A and B)/P(A)
# P(A and B)
# Probability that a recipe has water and is brazilian

# How many recipes have water AND are brazilian?
len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])
109
# P(A)
len(df['has_water'])
39774
# P(B|A)
# The chance that a recipe is brazilian if it has water in it
109/39774
0.0027404837330919697

Hypothesis two: the recipe is NOT brazilian

  • P(B|A) would be “if it contains water, what is the chance that it is NOT brazilian cuisine?”
  • P(A and B) would be “what is the chance that it contains both water and is NOT brazilian?”
  • P(A) would be “what is the chance that this contains water?”
# P(A and B)
# Probability that a recipe has water and is NOT brazilian

# How many recipes have water AND are NOT brazilian?
len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])
9385
# P(A)
# How many recipes have water?
len(df['has_water'])
39774
# P(B|A)
# The chance that a recipe is NOT brazilian if it has water in it
9385/39774
0.2359581636244783

What this boils down to

No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn’t really matter, it’s probably not brazilian.

len(df[df['cuisine'] == 'brazilian'])
467
len(df)
39774
# Only a little bit over 1% of our recipes are brazilian
# so even though it ALWAYS say it "not brazilian", it's usually right
467/39774
0.011741338562880274
1 - 467/39774
0.9882586614371197

Let’s fix up our labels

Before we had this:

def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

which does not scale well. If we wanted to add in more different cuisines, we’d need to keep adding in else ifs again and again and again until our fingers fell off. And we’d probably misspell something. And if we’re anything, it’s LAZY.

LabelEncoder to the rescue: Converts categories into numeric labels

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
# LabelEncoder has two parts: FIT and TRANSFORM
# FIT learns all of the possible labels
# TRANSFORM takes a list of categories and converts them into numbers
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])
LabelEncoder()
# Get the labels out as numbers
le.transform(['orange', 'blue', 'yellow'])
array([1, 0, 3])
# Send the label encoder each and every cuisine
le.fit(df['cuisine'])
LabelEncoder()
le.transform(df['cuisine'])
array([ 6, 16,  4, ...,  8,  3, 13])
df['cuisine_label'] = le.transform(df['cuisine'])
df.head(3)
cuisine id ingredient_list label has_spaghetti has_curry_powder is_brazilian has_water has_salt cuisine_label
0 greek 10259 romaine lettuce, black olives, grape tomatoes,... 0 False False 0 False False 6
1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... 0 False False 0 False True 16
2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... 0 False False 0 False True 4

Let’s train and test with our new labels

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.19840346962506678
clf.score(X_test, y_test)
0.20251414204902576

Let’s add some more features to see if we can do a better job

Right now I’m only looking at water and salt which doesn’t tell you much, maybe you’re looking at tortillas or cumin or soy sauce which tells you a little bit more.

df['has_miso'] = df['ingredient_list'].str.contains("miso")
df['has_soy_sauce'] = df['ingredient_list'].str.contains("soy sauce")
df['has_cilantro'] = df['ingredient_list'].str.contains("cilantro")
df['has_black_olives'] = df['ingredient_list'].str.contains("black olives")
df['has_tortillas'] = df['ingredient_list'].str.contains("tortillas")
df['has_turmeric'] = df['ingredient_list'].str.contains("turmeric")
df['has_pistachios'] = df['ingredient_list'].str.contains("pistachios")
df['has_lemongrass'] = df['ingredient_list'].str.contains("lemongrass")

Our new feature set is!!! df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.score(X_train, y_train)
0.37232471165027187
clf.score(X_test, y_test)
0.36379635449402892

This is taking forever, please let there be an automatic way to pick out all

of the words

from sklearn.feature_extraction.text import CountVectorizer

# STEP ONE: .fit to learn all of the words
# STEP TWO: .transform to turn a sentence into numbers

#vectorizer = CountVectorizer()
# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'
# Only pick the top 3000 most frequent ngrams
vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)
# We have some sentences
# We're going to feed it to the vectorizer
# and it's going to learn all of the words
sentences = [
    "cats are cool",
    "dogs are cool"
]
vectorizer.fit(sentences)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
# We're going to take some sentences and feed it to the vectorizer
# and its' going to convert it into numbers
vectorizer.transform(sentences)
<2x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>
# But it looks bad to look at so I'll use .toarray()
vectorizer.transform(sentences).toarray()
array([[1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1]])
# In our case, our text is the list of ingredients. We can get it through
df['ingredient_list'].head()
0    romaine lettuce, black olives, grape tomatoes,...
1    plain flour, ground pepper, salt, tomatoes, gr...
2    eggs, pepper, salt, mayonaise, cooking oil, gr...
3                    water, vegetable oil, wheat, salt
4    black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object
# Dear vectorizer, please learn all of these words
vectorizer.fit(df['ingredient_list'])
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
# Dear vectorizer, please convert ingredient_list into features
# That we can do machine learning on

every_single_word_features = vectorizer.transform(df['ingredient_list'])
every_single_word_features
<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 1243216 stored elements in Compressed Sparse Row format>

Now let’s try with our new complete labels and our new complete features that

includes every single word

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    every_single_word_features,
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

This is Naive Bayes with every word as a feature pushed through the

CountVectorizer

print("This is Naive Bayes")

from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
%time clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", clf.score(X_test, y_test))
This is Naive Bayes
CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms
Wall time: 109 ms
Training score: (stuff it already knows) 0.714384487256
Testing score: (stuff it hasn't seen before): 0.680578252671

But maybe it’s just chance? Let’s try the Dummy Classifier

from sklearn.dummy import DummyClassifier

print("This is the Dummy Classifier")

dummy_clf = DummyClassifier()
%time dummy_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", dummy_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", dummy_clf.score(X_test, y_test))
This is the Dummy Classifier
CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms
Wall time: 2.41 ms
Training score: (stuff it already knows) 0.100254564883
Testing score: (stuff it hasn't seen before): 0.0999371464488

This is a Decision Tree with every single feature from the CountVectorizer

print("This is a Decision Tree")

from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))
This is a Decision Tree
CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s
Wall time: 19.7 s
Training score: (stuff it already knows) 0.999780005657
Testing score: (stuff it hasn't seen before): 0.638592080453
from sklearn.ensemble import RandomForestClassifier

print("This is a Random Forest")

tree_clf = RandomForestClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))
This is a Random Forest
CPU times: user 10 s, sys: 288 ms, total: 10.3 s
Wall time: 13.6 s
Training score: (stuff it already knows) 0.992645903391
Testing score: (stuff it hasn't seen before): 0.706096794469

How do you do this in the real world with new data?

every_single_word_features = vectorizer.transform(df['ingredient_list'])
# Import the Naive bayes thing
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()

# Give the classifier EVERYTHING we know, not holding back anything
clf.fit(every_single_word_features, df['cuisine_label'])

# We have some new stuff we have not categorized
incoming_recipes = [
    "spaghetti tomato sauce garlic onion water",
    "soy sauce ginger sugar butter",
    "green papaya thai chilies palm sugar",
    "butter oil salt black pepper water milk bubblegumpie"
]

features_for_new_recipes = vectorizer.transform(incoming_recipes)
features_for_new_recipes
<4x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>
predictions = clf.predict(features_for_new_recipes)
predictions
array([ 4, 11,  4, 16])
# The predictions are all categories that the labelencoder decided on
# Let's convert those numeric ones back into real fun cuisine words
le.inverse_transform(predictions)
array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)