Algorithms: Rules of Play
- Name of the algorithm
- What it’s used for (classification, clustering, maybe other things?)
- Why is it better/worse than other classification/clustering/etc algorithms
- How to get our data into a format that is good for that algorithm
- REALISTIC data sets
- What the output means technically
- What the output means in like real life language and practically speaking
- What kind of datasets you use this algorithm for
- Examples of when it was used in journalism OR maybe could have been used
- Examples of when it was used period
- Pitfalls
- Maybe maybe maybe a little bit of math
- How to ground them for a less technical audience and to help engage them in what the algorithm is doing
Naive Bayes
Download and extract recipes.csv.zip
from #algorithms
and start a new
Jupyter Notebook!!!!
Classification algorithm - spam filter
The more spammy words that are in an email, the more like it is to be spam
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
QUESTION ONE: What are we doing and why are we using Naive Bayes?
We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?
We’re going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it’s italian because we love italian food and we only want to eat italian food.
RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.
For clustering
- You’ll get a lot of documents
- You feed it to an algorithm, tell it create
x
number of categories - The machine gives you back categories whether they make sense or not
For classification (which we are doing now)
- You’ll get a lot of documents
- You’ll classify some of them into categories that you know and love
- You’ll ask the algorithm what categories a new bunch of unlabeled documents end up in
All mean the same thing: CATEGORY = CLASS = LABEL
The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).
How does Naive Bayes work?
NAIVE BAYES WORKS WITH TEXT (kind of)
Bayes Theorem (kind of)
- If you see a word that is normally in a spam email, there’s a higher chance it’s spam
- If you see a word that is normally in a non-spam email, there’s a higher chance it’s not spam
Naive: every word/ingredient/etc is independent of any other word
FOR US: If you see ingredients that are normally in italian food, it’s probably italian
Secret trick: you can’t just use text, you have to convert into numbers
Types of Naive Bayes
Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.
Multinominal Naive Bayes - (multiple numbers): You count the words. You care about whether a word appears once or twice or three times or ten times. This is better for long passages
Bernoulli Naive Bayes - True/False Bayes: You only care if the word shows up
(True
) or it doesn’t show up (False
) - this is better for short passages
STEP ONE: Let’s convert our text data into numerical data
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
Our problem: Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn’t matter
Two things to convert into numbers:
- Our labels (a.k.a. the categories everything belongs in)
- Our features
Converting our labels into numbers
We have two labels
- italian =
1
- not italian =
0
cuisine | id | ingredient_list | |
---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... |
3 | indian | 22213 | water, vegetable oil, wheat, salt |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... |
cuisine | id | ingredient_list | label | |
---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 |
5 | jamaican | 6602 | plain flour, sugar, butter, eggs, fresh ginger... | 0 |
6 | spanish | 42779 | olive oil, salt, medium shrimp, pepper, garlic... | 0 |
7 | italian | 3735 | sugar, pistachio nuts, white almond bark, flou... | 1 |
8 | mexican | 16903 | olive oil, purple onion, fresh pineapple, pork... | 0 |
9 | italian | 12734 | chopped tomatoes, fresh basil, garlic, extra-v... | 1 |
Converting our features into numbers
Feature selection: The process of selecting the features that matter, in this case - what ingredients do we want to look at?
Our feature is going to be: whether it has spaghetti or not
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | |
---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 | False | False |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 | False | False |
5 | jamaican | 6602 | plain flour, sugar, butter, eggs, fresh ginger... | 0 | False | False |
6 | spanish | 42779 | olive oil, salt, medium shrimp, pepper, garlic... | 0 | False | False |
7 | italian | 3735 | sugar, pistachio nuts, white almond bark, flou... | 1 | False | False |
8 | mexican | 16903 | olive oil, purple onion, fresh pineapple, pork... | 0 | False | False |
9 | italian | 12734 | chopped tomatoes, fresh basil, garlic, extra-v... | 1 | False | False |
Let’s run our tests
Let’s feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!
Looking at our labels
We stored it in label
, and if it’s 0
it’s not italian, if it’s 1
it is
Italian
0 0
1 0
2 0
3 0
4 0
Name: label, dtype: int64
Look at our features
We have two features has_spaghetti
and has_curry_powder
.
has_spaghetti | has_curry_powder | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
Now let’s finally do this
has_spaghetti | has_curry_powder | |
---|---|---|
18816 | False | False |
30480 | False | False |
19110 | False | False |
29312 | False | False |
23782 | False | False |
7907 | False | False |
2456 | False | False |
27221 | False | False |
5228 | False | False |
37623 | False | False |
18641 | False | False |
9029 | False | False |
5549 | False | False |
21559 | False | False |
12168 | False | False |
4836 | False | False |
3560 | False | False |
21578 | False | False |
33579 | False | False |
5965 | False | False |
32581 | False | False |
5274 | False | False |
1544 | False | False |
10885 | False | False |
22168 | False | False |
29798 | False | False |
31228 | False | False |
4636 | False | False |
38889 | False | False |
39444 | False | False |
... | ... | ... |
19956 | False | False |
14863 | False | False |
8335 | False | False |
21372 | False | False |
8720 | False | False |
11752 | False | False |
10551 | False | False |
37474 | False | False |
7905 | False | False |
5923 | False | False |
14526 | False | False |
673 | False | False |
30444 | False | False |
15322 | False | False |
5476 | False | False |
37545 | False | False |
32634 | False | False |
36936 | False | False |
18970 | False | False |
5622 | False | False |
10731 | False | False |
37097 | False | False |
5822 | False | False |
35856 | False | False |
7579 | False | False |
27918 | False | False |
8601 | False | False |
5245 | False | False |
39665 | False | False |
13013 | False | False |
31819 rows × 2 columns
has_spaghetti | has_curry_powder | |
---|---|---|
23827 | False | False |
24607 | False | False |
16829 | False | False |
6473 | False | False |
23662 | False | False |
19742 | False | False |
37244 | False | False |
19552 | False | False |
6361 | False | False |
6786 | False | False |
27241 | False | False |
9034 | False | False |
34423 | False | False |
33399 | False | False |
19641 | False | True |
15389 | False | True |
11627 | False | False |
25811 | False | False |
22079 | False | False |
5254 | False | False |
22499 | False | False |
18948 | False | False |
13672 | False | False |
31390 | False | False |
26623 | False | False |
36470 | False | False |
14916 | False | False |
22337 | False | False |
27339 | False | False |
38540 | False | False |
... | ... | ... |
3409 | False | False |
38281 | False | False |
12014 | False | False |
10908 | False | False |
4647 | False | False |
22629 | False | False |
32925 | False | False |
20743 | False | False |
25604 | False | False |
34821 | False | False |
38273 | False | False |
24241 | False | False |
28217 | False | False |
25094 | False | False |
9433 | False | False |
3755 | False | False |
12877 | False | False |
37839 | False | False |
30193 | False | False |
5866 | False | False |
22191 | False | False |
29451 | False | True |
29878 | False | False |
26103 | False | False |
9126 | False | False |
32127 | False | False |
34047 | False | False |
3324 | False | False |
31076 | False | False |
104 | False | False |
7955 rows × 2 columns
31819
7955
18816 0
30480 0
19110 0
29312 1
23782 0
7907 0
2456 0
27221 0
5228 0
37623 0
18641 0
9029 0
5549 0
21559 0
12168 0
4836 0
3560 0
21578 0
33579 0
5965 0
32581 0
5274 1
1544 1
10885 0
22168 0
29798 1
31228 0
4636 1
38889 0
39444 0
..
19956 0
14863 1
8335 0
21372 1
8720 0
11752 0
10551 1
37474 1
7905 1
5923 1
14526 1
673 0
30444 0
15322 0
5476 0
37545 1
32634 0
36936 0
18970 0
5622 0
10731 0
37097 0
5822 0
35856 0
7579 0
27918 1
8601 0
5245 0
39665 0
13013 0
Name: label, dtype: int64
23827 0
24607 0
16829 1
6473 0
23662 0
19742 0
37244 0
19552 1
6361 0
6786 0
27241 0
9034 1
34423 1
33399 0
19641 0
15389 0
11627 0
25811 0
22079 1
5254 0
22499 0
18948 1
13672 1
31390 0
26623 1
36470 0
14916 1
22337 0
27339 0
38540 0
..
3409 1
38281 0
12014 1
10908 0
4647 0
22629 0
32925 0
20743 0
25604 1
34821 0
38273 1
24241 1
28217 0
25094 0
9433 0
3755 1
12877 0
37839 0
30193 0
5866 0
22191 0
29451 0
29878 0
26103 0
9126 0
32127 0
34047 0
3324 0
31076 0
104 0
Name: label, dtype: int64
Length of training labels: 31819
Length of testing labels: 7955
Length of training features: 31819
Length of testing features: 7955
Basically all that happened was train_test_split
took us from having a nice
dataframe where everything was together and split it into two groups of two -
separated our labels vs. our features, and our training data vs our testing
data.
Back to actually doing our fitting etc
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
array([0, 0, 0, ..., 0, 0, 0])
0.81083629278104274
0.80905091137649277
italian 7838
mexican 6438
southern_us 4320
indian 3003
chinese 2673
french 2646
cajun_creole 1546
thai 1539
japanese 1423
greek 1175
spanish 989
korean 830
vietnamese 825
moroccan 821
british 804
filipino 755
irish 667
jamaican 526
russian 489
brazilian 467
Name: cuisine, dtype: int64
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 True
28 False
29 False
...
39744 False
39745 False
39746 False
39747 False
39748 False
39749 False
39750 False
39751 False
39752 False
39753 False
39754 False
39755 False
39756 False
39757 False
39758 False
39759 False
39760 False
39761 False
39762 False
39763 False
39764 False
39765 False
39766 False
39767 True
39768 False
39769 False
39770 False
39771 False
39772 False
39773 False
Name: has_spaghetti, dtype: bool
has_spaghetti | |
---|---|
0 | False |
1 | False |
2 | False |
3 | False |
4 | False |
5 | False |
6 | False |
7 | False |
8 | False |
9 | False |
10 | False |
11 | False |
12 | False |
13 | False |
14 | False |
15 | False |
16 | False |
17 | False |
18 | False |
19 | False |
20 | False |
21 | False |
22 | False |
23 | False |
24 | False |
25 | False |
26 | False |
27 | True |
28 | False |
29 | False |
... | ... |
39744 | False |
39745 | False |
39746 | False |
39747 | False |
39748 | False |
39749 | False |
39750 | False |
39751 | False |
39752 | False |
39753 | False |
39754 | False |
39755 | False |
39756 | False |
39757 | False |
39758 | False |
39759 | False |
39760 | False |
39761 | False |
39762 | False |
39763 | False |
39764 | False |
39765 | False |
39766 | False |
39767 | True |
39768 | False |
39769 | False |
39770 | False |
39771 | False |
39772 | False |
39773 | False |
39774 rows × 1 columns
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | |
---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False |
3 | indian | 22213 | water, vegetable oil, wheat, salt | 0 | False | False |
4 | indian | 13162 | black pepper, shallots, cornflour, cayenne pep... | 0 | False | False |
Wow, we did a really great job! Let’s try another cuisine
Step 1: Preparing our data
Creating labels that scikit-learn can use
Our cuisine is , so we’ll do 0
and 1
as to whether it’s that cuisine or not
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | |
---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 |
Creating features that scikit-learn can use
It’s Bernoulli Naive Bayes, so it’s True
and False
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | has_water | has_salt | |
---|---|---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 | False | False |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 | False | True |
Step 2: Create the test/train split
Step 3: Create classifier, train and test
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
0.98821458876771739
0.9884349465744815
Dummy Classifier to see worst possible performance
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
0.98821458876771739
0.9884349465744815
We just got destroyed by math: let’s actually understand Naive Bayes
Naive Bayes gives you back a probability for each possible label - so, % chance that it’s brazilian vs. the % chance that it is not brazilian. We’ll use this to see what went wrong.
Math stuff
Naive Bayes is all about calculating the probability of “B given A”, a.k.a., the chance of B being true if A is true.
-
**Bayes Theorem:**
P(BA) = P(A and B)/P(A)` P(A)
means “what is the probability of A being true?”P(B|A)
means “if A is true, what is the probability of B being true?”P(A and B)
means “what is the probability of both A and B being true?”
Example: We have a recipe and it has water in it. Is it brazilian?
Hypothesis one: the recipe is brazilian
P(B|A)
would be “if it contains water, what is the chance that it is brazilian cuisine?”P(A and B)
would be “what is the chance that it contains both water and is brazilian?”P(A)
would be “what is the chance that this contains water?”
109
39774
0.0027404837330919697
Hypothesis two: the recipe is NOT brazilian
P(B|A)
would be “if it contains water, what is the chance that it is NOT brazilian cuisine?”P(A and B)
would be “what is the chance that it contains both water and is NOT brazilian?”P(A)
would be “what is the chance that this contains water?”
9385
39774
0.2359581636244783
What this boils down to
No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn’t really matter, it’s probably not brazilian.
467
39774
0.011741338562880274
0.9882586614371197
Let’s fix up our labels
Before we had this:
def make_label(cuisine):
if cuisine == "brazilian":
return 1
else:
return 0
which does not scale well. If we wanted to add in more different cuisines, we’d need to keep adding in else ifs again and again and again until our fingers fell off. And we’d probably misspell something. And if we’re anything, it’s LAZY.
LabelEncoder to the rescue: Converts categories into numeric labels
LabelEncoder()
array([1, 0, 3])
LabelEncoder()
array([ 6, 16, 4, ..., 8, 3, 13])
cuisine | id | ingredient_list | label | has_spaghetti | has_curry_powder | is_brazilian | has_water | has_salt | cuisine_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | greek | 10259 | romaine lettuce, black olives, grape tomatoes,... | 0 | False | False | 0 | False | False | 6 |
1 | southern_us | 25693 | plain flour, ground pepper, salt, tomatoes, gr... | 0 | False | False | 0 | False | True | 16 |
2 | filipino | 20130 | eggs, pepper, salt, mayonaise, cooking oil, gr... | 0 | False | False | 0 | False | True | 4 |
Let’s train and test with our new labels
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
0.19840346962506678
0.20251414204902576
Let’s add some more features to see if we can do a better job
Right now I’m only looking at water and salt which doesn’t tell you much, maybe you’re looking at tortillas or cumin or soy sauce which tells you a little bit more.
Our new feature set is!!! df[['has_spaghetti', 'has_miso', 'has_soy_sauce',
'has_cilantro','has_black_olives','has_tortillas','has_turmeric',
'has_pistachios','has_lemongrass']]
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
0.37232471165027187
0.36379635449402892
This is taking forever, please let there be an automatic way to pick out all
of the words
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=3000, min_df=1,
ngram_range=(1, 2), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
<2x7 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
array([[1, 1, 1, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1, 1]])
0 romaine lettuce, black olives, grape tomatoes,...
1 plain flour, ground pepper, salt, tomatoes, gr...
2 eggs, pepper, salt, mayonaise, cooking oil, gr...
3 water, vegetable oil, wheat, salt
4 black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=3000, min_df=1,
ngram_range=(1, 2), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
with 1243216 stored elements in Compressed Sparse Row format>
Now let’s try with our new complete labels and our new complete features that
includes every single word
This is Naive Bayes with every word as a feature pushed through the
CountVectorizer
This is Naive Bayes
CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms
Wall time: 109 ms
Training score: (stuff it already knows) 0.714384487256
Testing score: (stuff it hasn't seen before): 0.680578252671
But maybe it’s just chance? Let’s try the Dummy Classifier
This is the Dummy Classifier
CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms
Wall time: 2.41 ms
Training score: (stuff it already knows) 0.100254564883
Testing score: (stuff it hasn't seen before): 0.0999371464488
This is a Decision Tree with every single feature from the CountVectorizer
This is a Decision Tree
CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s
Wall time: 19.7 s
Training score: (stuff it already knows) 0.999780005657
Testing score: (stuff it hasn't seen before): 0.638592080453
This is a Random Forest
CPU times: user 10 s, sys: 288 ms, total: 10.3 s
Wall time: 13.6 s
Training score: (stuff it already knows) 0.992645903391
Testing score: (stuff it hasn't seen before): 0.706096794469
How do you do this in the real world with new data?
<4x3000 sparse matrix of type '<class 'numpy.int64'>'
with 35 stored elements in Compressed Sparse Row format>
array([ 4, 11, 4, 16])
array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)