{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Algorithms: Rules of Play\n", "\n", "1. Name of the algorithm\n", "2. What it's used for (classification, clustering, maybe other things?)\n", "3. Why is it better/worse than other classification/clustering/etc algorithms\n", "4. How to get our data into a format that is good for that algorithm\n", "4. REALISTIC data sets\n", "5. What the output means technically\n", "6. What the output means in like real life language and practically speaking\n", "7. What kind of datasets you use this algorithm for\n", "8. Examples of when it was used in journalism OR maybe could have been used\n", "9. Examples of when it was used period\n", "10. Pitfalls\n", "11. Maybe maybe maybe a little bit of math\n", "12. How to ground them for a less technical audience and to help engage them in what the algorithm is doing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Naive Bayes\n", "\n", "Download and extract `recipes.csv.zip` from `#algorithms` and start a new Jupyter Notebook!!!!\n", "\n", "**Classification algorithm** - spam filter\n", "\n", "The more spammy words that are in an email, the more like it is to be spam" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_list
0greek10259romaine lettuce, black olives, grape tomatoes,...
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...
3indian22213water, vegetable oil, wheat, salt
4indian13162black pepper, shallots, cornflour, cayenne pep...
\n", "
" ], "text/plain": [ " cuisine id ingredient_list\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,...\n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...\n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...\n", "3 indian 22213 water, vegetable oil, wheat, salt\n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep..." ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"recipes.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# QUESTION ONE: What are we doing and why are we using Naive Bayes?\n", "\n", "We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?\n", "\n", "We're going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it's italian because we love italian food and we only want to eat italian food.\n", "\n", "RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.\n", "\n", "**For clustering**\n", "\n", "1. You'll get a lot of documents\n", "2. You feed it to an algorithm, tell it create `x` number of categories\n", "3. The machine gives you back categories whether they make sense or not\n", "\n", "**For classification (which we are doing now)**\n", "\n", "1. You'll get a lot of documents\n", "2. You'll classify some of them into categories that you know and love\n", "3. You'll ask the algorithm what categories a new bunch of unlabeled documents end up in\n", "\n", "All mean the same thing: CATEGORY = CLASS = LABEL\n", "\n", "The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).\n", "\n", "## How does Naive Bayes work?\n", "\n", "NAIVE BAYES WORKS WITH TEXT (kind of)\n", "\n", "**Bayes Theorem (kind of)**\n", "\n", "* If you see a word that is normally in a spam email, there's a higher chance it's spam\n", "* If you see a word that is normally in a non-spam email, there's a higher chance it's not spam\n", "\n", "**Naive:** every word/ingredient/etc is independent of any other word\n", "\n", "FOR US: If you see ingredients that are normally in italian food, it's probably italian\n", "\n", "Secret trick: you can't just use text, you have to convert into numbers\n", "\n", "## Types of Naive Bayes\n", "\n", "Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.\n", "\n", "**Multinominal Naive Bayes - (multiple numbers)**: You count the words. You care about whether a word appears once or twice or three times or ten times. *This is better for long passages*\n", "\n", "**Bernoulli Naive Bayes - True/False Bayes:** You only care if the word shows up (`True`) or it doesn't show up (`False`) - *this is better for short passages*\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# STEP ONE: Let's convert our text data into numerical data" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_list
0greek10259romaine lettuce, black olives, grape tomatoes,...
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...
3indian22213water, vegetable oil, wheat, salt
4indian13162black pepper, shallots, cornflour, cayenne pep...
\n", "
" ], "text/plain": [ " cuisine id ingredient_list\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,...\n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...\n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...\n", "3 indian 22213 water, vegetable oil, wheat, salt\n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep..." ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Our problem:** Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn't matter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Two things to convert into numbers:**\n", "\n", "* Our labels (a.k.a. the categories everything belongs in)\n", "* Our features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting our labels into numbers\n", "\n", "We have two labels\n", "\n", "* italian = `1`\n", "* not italian = `0`" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_list
0greek10259romaine lettuce, black olives, grape tomatoes,...
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...
3indian22213water, vegetable oil, wheat, salt
4indian13162black pepper, shallots, cornflour, cayenne pep...
\n", "
" ], "text/plain": [ " cuisine id ingredient_list\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,...\n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr...\n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr...\n", "3 indian 22213 water, vegetable oil, wheat, salt\n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep..." ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def make_label(cuisine):\n", " if cuisine == \"italian\":\n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabel
0greek10259romaine lettuce, black olives, grape tomatoes,...0
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...0
3indian22213water, vegetable oil, wheat, salt0
4indian13162black pepper, shallots, cornflour, cayenne pep...0
5jamaican6602plain flour, sugar, butter, eggs, fresh ginger...0
6spanish42779olive oil, salt, medium shrimp, pepper, garlic...0
7italian3735sugar, pistachio nuts, white almond bark, flou...1
8mexican16903olive oil, purple onion, fresh pineapple, pork...0
9italian12734chopped tomatoes, fresh basil, garlic, extra-v...1
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... \n", "3 indian 22213 water, vegetable oil, wheat, salt \n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep... \n", "5 jamaican 6602 plain flour, sugar, butter, eggs, fresh ginger... \n", "6 spanish 42779 olive oil, salt, medium shrimp, pepper, garlic... \n", "7 italian 3735 sugar, pistachio nuts, white almond bark, flou... \n", "8 mexican 16903 olive oil, purple onion, fresh pineapple, pork... \n", "9 italian 12734 chopped tomatoes, fresh basil, garlic, extra-v... \n", "\n", " label \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "5 0 \n", "6 0 \n", "7 1 \n", "8 0 \n", "9 1 " ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'] = df['cuisine'].apply(make_label)\n", "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting our features into numbers\n", "\n", "**Feature selection:** The process of selecting the features that matter, in this case - what ingredients do we want to look at?\n", "\n", "Our feature is going to be: whether it has spaghetti or not" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabelhas_spaghettihas_curry_powder
0greek10259romaine lettuce, black olives, grape tomatoes,...0FalseFalse
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0FalseFalse
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...0FalseFalse
3indian22213water, vegetable oil, wheat, salt0FalseFalse
4indian13162black pepper, shallots, cornflour, cayenne pep...0FalseFalse
5jamaican6602plain flour, sugar, butter, eggs, fresh ginger...0FalseFalse
6spanish42779olive oil, salt, medium shrimp, pepper, garlic...0FalseFalse
7italian3735sugar, pistachio nuts, white almond bark, flou...1FalseFalse
8mexican16903olive oil, purple onion, fresh pineapple, pork...0FalseFalse
9italian12734chopped tomatoes, fresh basil, garlic, extra-v...1FalseFalse
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... \n", "3 indian 22213 water, vegetable oil, wheat, salt \n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep... \n", "5 jamaican 6602 plain flour, sugar, butter, eggs, fresh ginger... \n", "6 spanish 42779 olive oil, salt, medium shrimp, pepper, garlic... \n", "7 italian 3735 sugar, pistachio nuts, white almond bark, flou... \n", "8 mexican 16903 olive oil, purple onion, fresh pineapple, pork... \n", "9 italian 12734 chopped tomatoes, fresh basil, garlic, extra-v... \n", "\n", " label has_spaghetti has_curry_powder \n", "0 0 False False \n", "1 0 False False \n", "2 0 False False \n", "3 0 False False \n", "4 0 False False \n", "5 0 False False \n", "6 0 False False \n", "7 1 False False \n", "8 0 False False \n", "9 1 False False " ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['has_spaghetti'] = df['ingredient_list'].str.contains(\"spaghetti\")\n", "df['has_curry_powder'] = df['ingredient_list'].str.contains(\"curry powder\")\n", "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's run our tests\n", "\n", "Let's feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!\n", "\n", "### Looking at our labels\n", "\n", "We stored it in `label`, and if it's `0` it's not italian, if it's `1` it is Italian" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "Name: label, dtype: int64" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look at our features\n", "\n", "We have two features `has_spaghetti` and `has_curry_powder`." ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
has_spaghettihas_curry_powder
0FalseFalse
1FalseFalse
2FalseFalse
3FalseFalse
4FalseFalse
\n", "
" ], "text/plain": [ " has_spaghetti has_curry_powder\n", "0 False False\n", "1 False False\n", "2 False False\n", "3 False False\n", "4 False False" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['has_spaghetti', 'has_curry_powder']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now let's finally do this" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We need to split into training and testing data\n", "from sklearn.cross_validation import train_test_split" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Splitting into...\n", "# X = are all our features\n", "# y = are all our labels\n", "# X_train are our features to train on (80%)\n", "# y_train are our labels to train on (80%)\n", "# X_test are our features to test on (20%)\n", "# y_train are our labels to test on (20%)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES\n", " df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
has_spaghettihas_curry_powder
18816FalseFalse
30480FalseFalse
19110FalseFalse
29312FalseFalse
23782FalseFalse
7907FalseFalse
2456FalseFalse
27221FalseFalse
5228FalseFalse
37623FalseFalse
18641FalseFalse
9029FalseFalse
5549FalseFalse
21559FalseFalse
12168FalseFalse
4836FalseFalse
3560FalseFalse
21578FalseFalse
33579FalseFalse
5965FalseFalse
32581FalseFalse
5274FalseFalse
1544FalseFalse
10885FalseFalse
22168FalseFalse
29798FalseFalse
31228FalseFalse
4636FalseFalse
38889FalseFalse
39444FalseFalse
.........
19956FalseFalse
14863FalseFalse
8335FalseFalse
21372FalseFalse
8720FalseFalse
11752FalseFalse
10551FalseFalse
37474FalseFalse
7905FalseFalse
5923FalseFalse
14526FalseFalse
673FalseFalse
30444FalseFalse
15322FalseFalse
5476FalseFalse
37545FalseFalse
32634FalseFalse
36936FalseFalse
18970FalseFalse
5622FalseFalse
10731FalseFalse
37097FalseFalse
5822FalseFalse
35856FalseFalse
7579FalseFalse
27918FalseFalse
8601FalseFalse
5245FalseFalse
39665FalseFalse
13013FalseFalse
\n", "

31819 rows × 2 columns

\n", "
" ], "text/plain": [ " has_spaghetti has_curry_powder\n", "18816 False False\n", "30480 False False\n", "19110 False False\n", "29312 False False\n", "23782 False False\n", "7907 False False\n", "2456 False False\n", "27221 False False\n", "5228 False False\n", "37623 False False\n", "18641 False False\n", "9029 False False\n", "5549 False False\n", "21559 False False\n", "12168 False False\n", "4836 False False\n", "3560 False False\n", "21578 False False\n", "33579 False False\n", "5965 False False\n", "32581 False False\n", "5274 False False\n", "1544 False False\n", "10885 False False\n", "22168 False False\n", "29798 False False\n", "31228 False False\n", "4636 False False\n", "38889 False False\n", "39444 False False\n", "... ... ...\n", "19956 False False\n", "14863 False False\n", "8335 False False\n", "21372 False False\n", "8720 False False\n", "11752 False False\n", "10551 False False\n", "37474 False False\n", "7905 False False\n", "5923 False False\n", "14526 False False\n", "673 False False\n", "30444 False False\n", "15322 False False\n", "5476 False False\n", "37545 False False\n", "32634 False False\n", "36936 False False\n", "18970 False False\n", "5622 False False\n", "10731 False False\n", "37097 False False\n", "5822 False False\n", "35856 False False\n", "7579 False False\n", "27918 False False\n", "8601 False False\n", "5245 False False\n", "39665 False False\n", "13013 False False\n", "\n", "[31819 rows x 2 columns]" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Oh hey, it's just our features from the dataframe\n", "X_train" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
has_spaghettihas_curry_powder
23827FalseFalse
24607FalseFalse
16829FalseFalse
6473FalseFalse
23662FalseFalse
19742FalseFalse
37244FalseFalse
19552FalseFalse
6361FalseFalse
6786FalseFalse
27241FalseFalse
9034FalseFalse
34423FalseFalse
33399FalseFalse
19641FalseTrue
15389FalseTrue
11627FalseFalse
25811FalseFalse
22079FalseFalse
5254FalseFalse
22499FalseFalse
18948FalseFalse
13672FalseFalse
31390FalseFalse
26623FalseFalse
36470FalseFalse
14916FalseFalse
22337FalseFalse
27339FalseFalse
38540FalseFalse
.........
3409FalseFalse
38281FalseFalse
12014FalseFalse
10908FalseFalse
4647FalseFalse
22629FalseFalse
32925FalseFalse
20743FalseFalse
25604FalseFalse
34821FalseFalse
38273FalseFalse
24241FalseFalse
28217FalseFalse
25094FalseFalse
9433FalseFalse
3755FalseFalse
12877FalseFalse
37839FalseFalse
30193FalseFalse
5866FalseFalse
22191FalseFalse
29451FalseTrue
29878FalseFalse
26103FalseFalse
9126FalseFalse
32127FalseFalse
34047FalseFalse
3324FalseFalse
31076FalseFalse
104FalseFalse
\n", "

7955 rows × 2 columns

\n", "
" ], "text/plain": [ " has_spaghetti has_curry_powder\n", "23827 False False\n", "24607 False False\n", "16829 False False\n", "6473 False False\n", "23662 False False\n", "19742 False False\n", "37244 False False\n", "19552 False False\n", "6361 False False\n", "6786 False False\n", "27241 False False\n", "9034 False False\n", "34423 False False\n", "33399 False False\n", "19641 False True\n", "15389 False True\n", "11627 False False\n", "25811 False False\n", "22079 False False\n", "5254 False False\n", "22499 False False\n", "18948 False False\n", "13672 False False\n", "31390 False False\n", "26623 False False\n", "36470 False False\n", "14916 False False\n", "22337 False False\n", "27339 False False\n", "38540 False False\n", "... ... ...\n", "3409 False False\n", "38281 False False\n", "12014 False False\n", "10908 False False\n", "4647 False False\n", "22629 False False\n", "32925 False False\n", "20743 False False\n", "25604 False False\n", "34821 False False\n", "38273 False False\n", "24241 False False\n", "28217 False False\n", "25094 False False\n", "9433 False False\n", "3755 False False\n", "12877 False False\n", "37839 False False\n", "30193 False False\n", "5866 False False\n", "22191 False False\n", "29451 False True\n", "29878 False False\n", "26103 False False\n", "9126 False False\n", "32127 False False\n", "34047 False False\n", "3324 False False\n", "31076 False False\n", "104 False False\n", "\n", "[7955 rows x 2 columns]" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# X is always the features, whether it's for training or for testing\n", "X_test" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "31819" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X_train)" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "7955" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X_test)" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We're testing on ~8000 and training on ~32000" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "18816 0\n", "30480 0\n", "19110 0\n", "29312 1\n", "23782 0\n", "7907 0\n", "2456 0\n", "27221 0\n", "5228 0\n", "37623 0\n", "18641 0\n", "9029 0\n", "5549 0\n", "21559 0\n", "12168 0\n", "4836 0\n", "3560 0\n", "21578 0\n", "33579 0\n", "5965 0\n", "32581 0\n", "5274 1\n", "1544 1\n", "10885 0\n", "22168 0\n", "29798 1\n", "31228 0\n", "4636 1\n", "38889 0\n", "39444 0\n", " ..\n", "19956 0\n", "14863 1\n", "8335 0\n", "21372 1\n", "8720 0\n", "11752 0\n", "10551 1\n", "37474 1\n", "7905 1\n", "5923 1\n", "14526 1\n", "673 0\n", "30444 0\n", "15322 0\n", "5476 0\n", "37545 1\n", "32634 0\n", "36936 0\n", "18970 0\n", "5622 0\n", "10731 0\n", "37097 0\n", "5822 0\n", "35856 0\n", "7579 0\n", "27918 1\n", "8601 0\n", "5245 0\n", "39665 0\n", "13013 0\n", "Name: label, dtype: int64" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# y_train is our labels that we are training one\n", "y_train" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "23827 0\n", "24607 0\n", "16829 1\n", "6473 0\n", "23662 0\n", "19742 0\n", "37244 0\n", "19552 1\n", "6361 0\n", "6786 0\n", "27241 0\n", "9034 1\n", "34423 1\n", "33399 0\n", "19641 0\n", "15389 0\n", "11627 0\n", "25811 0\n", "22079 1\n", "5254 0\n", "22499 0\n", "18948 1\n", "13672 1\n", "31390 0\n", "26623 1\n", "36470 0\n", "14916 1\n", "22337 0\n", "27339 0\n", "38540 0\n", " ..\n", "3409 1\n", "38281 0\n", "12014 1\n", "10908 0\n", "4647 0\n", "22629 0\n", "32925 0\n", "20743 0\n", "25604 1\n", "34821 0\n", "38273 1\n", "24241 1\n", "28217 0\n", "25094 0\n", "9433 0\n", "3755 1\n", "12877 0\n", "37839 0\n", "30193 0\n", "5866 0\n", "22191 0\n", "29451 0\n", "29878 0\n", "26103 0\n", "9126 0\n", "32127 0\n", "34047 0\n", "3324 0\n", "31076 0\n", "104 0\n", "Name: label, dtype: int64" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# And y_test is the labels we're testing on\n", "y_test" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Length of training labels: 31819\n", "Length of testing labels: 7955\n", "Length of training features: 31819\n", "Length of testing features: 7955\n" ] } ], "source": [ "print(\"Length of training labels:\", len(y_train))\n", "print(\"Length of testing labels:\", len(y_test))\n", "print(\"Length of training features:\", len(X_train))\n", "print(\"Length of testing features:\", len(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically all that happened was `train_test_split` took us from having a nice dataframe where everything was together and split it into two groups of two - separated our labels vs. our features, and our training data vs our testing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Back to actually doing our fitting etc" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Splitting into...\n", "# X = are all our features\n", "# y = are all our labels\n", "# X_train are our features to train on (80%)\n", "# y_train are our labels to train on (80%)\n", "# X_test are our features to test on (20%)\n", "# y_train are our labels to test on (20%)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES\n", " df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import naive_bayes to get access to ALL kinds of naive bayes classifiers\n", "# But REMEMBER we're using Bernoulli because it's for true/false which is fine\n", "# for small passages\n", "from sklearn import naive_bayes\n", "\n", "# Create a Bernoulli Naive Bayes classifier\n", "clf = naive_bayes.BernoulliNB()\n", "\n", "# Feed the classifier two things:\n", "# * our training features (X_train)\n", "# * our training labels (y_train)\n", "# To help it study for the exam later when we test it\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, ..., 0, 0, 0])" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This looks ugly but in theory it's what every recipe is\n", "# All those zeroes = not italian\n", "# We know the first three aren't italian and the last three aren't italian\n", "clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.81083629278104274" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Naive Bayes can't overfit, really\n", "# It can't \"study too hard\" it can't \"memorize the questions\"\n", "# (a decision tree can)\n", "# So if we give it the training data back it will get some wrong\n", "clf.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.80905091137649277" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "italian 7838\n", "mexican 6438\n", "southern_us 4320\n", "indian 3003\n", "chinese 2673\n", "french 2646\n", "cajun_creole 1546\n", "thai 1539\n", "japanese 1423\n", "greek 1175\n", "spanish 989\n", "korean 830\n", "vietnamese 825\n", "moroccan 821\n", "british 804\n", "filipino 755\n", "irish 667\n", "jamaican 526\n", "russian 489\n", "brazilian 467\n", "Name: cuisine, dtype: int64" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['cuisine'].value_counts()" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 True\n", "28 False\n", "29 False\n", " ... \n", "39744 False\n", "39745 False\n", "39746 False\n", "39747 False\n", "39748 False\n", "39749 False\n", "39750 False\n", "39751 False\n", "39752 False\n", "39753 False\n", "39754 False\n", "39755 False\n", "39756 False\n", "39757 False\n", "39758 False\n", "39759 False\n", "39760 False\n", "39761 False\n", "39762 False\n", "39763 False\n", "39764 False\n", "39765 False\n", "39766 False\n", "39767 True\n", "39768 False\n", "39769 False\n", "39770 False\n", "39771 False\n", "39772 False\n", "39773 False\n", "Name: has_spaghetti, dtype: bool" ] }, "execution_count": 211, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['has_spaghetti']" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
has_spaghetti
0False
1False
2False
3False
4False
5False
6False
7False
8False
9False
10False
11False
12False
13False
14False
15False
16False
17False
18False
19False
20False
21False
22False
23False
24False
25False
26False
27True
28False
29False
......
39744False
39745False
39746False
39747False
39748False
39749False
39750False
39751False
39752False
39753False
39754False
39755False
39756False
39757False
39758False
39759False
39760False
39761False
39762False
39763False
39764False
39765False
39766False
39767True
39768False
39769False
39770False
39771False
39772False
39773False
\n", "

39774 rows × 1 columns

\n", "
" ], "text/plain": [ " has_spaghetti\n", "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 True\n", "28 False\n", "29 False\n", "... ...\n", "39744 False\n", "39745 False\n", "39746 False\n", "39747 False\n", "39748 False\n", "39749 False\n", "39750 False\n", "39751 False\n", "39752 False\n", "39753 False\n", "39754 False\n", "39755 False\n", "39756 False\n", "39757 False\n", "39758 False\n", "39759 False\n", "39760 False\n", "39761 False\n", "39762 False\n", "39763 False\n", "39764 False\n", "39765 False\n", "39766 False\n", "39767 True\n", "39768 False\n", "39769 False\n", "39770 False\n", "39771 False\n", "39772 False\n", "39773 False\n", "\n", "[39774 rows x 1 columns]" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#df[['has_spaghetti', 'has_curry_powder']]\n", "df[['has_spaghetti']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 155, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabelhas_spaghettihas_curry_powder
0greek10259romaine lettuce, black olives, grape tomatoes,...0FalseFalse
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0FalseFalse
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...0FalseFalse
3indian22213water, vegetable oil, wheat, salt0FalseFalse
4indian13162black pepper, shallots, cornflour, cayenne pep...0FalseFalse
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... \n", "3 indian 22213 water, vegetable oil, wheat, salt \n", "4 indian 13162 black pepper, shallots, cornflour, cayenne pep... \n", "\n", " label has_spaghetti has_curry_powder \n", "0 0 False False \n", "1 0 False False \n", "2 0 False False \n", "3 0 False False \n", "4 0 False False " ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wow, we did a really great job! Let's try another cuisine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Preparing our data\n", "\n", "### Creating labels that scikit-learn can use\n", "\n", "Our cuisine is , so we'll do `0` and `1` as to whether it's that cuisine or not " ] }, { "cell_type": "code", "execution_count": 156, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def make_label(cuisine):\n", " if cuisine == \"brazilian\":\n", " return 1\n", " else:\n", " return 0\n", "\n", "df['is_brazilian'] = df['cuisine'].apply(make_label)" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabelhas_spaghettihas_curry_powderis_brazilian
0greek10259romaine lettuce, black olives, grape tomatoes,...0FalseFalse0
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0FalseFalse0
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "\n", " label has_spaghetti has_curry_powder is_brazilian \n", "0 0 False False 0 \n", "1 0 False False 0 " ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating features that scikit-learn can use\n", "\n", "It's Bernoulli Naive Bayes, so it's `True` and `False`" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['has_water'] = df['ingredient_list'].str.contains('water')\n", "df['has_salt'] = df['ingredient_list'].str.contains('salt')" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabelhas_spaghettihas_curry_powderis_brazilianhas_waterhas_salt
0greek10259romaine lettuce, black olives, grape tomatoes,...0FalseFalse0FalseFalse
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0FalseFalse0FalseTrue
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "\n", " label has_spaghetti has_curry_powder is_brazilian has_water has_salt \n", "0 0 False False 0 False False \n", "1 0 False False 0 False True " ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Create the test/train split" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['has_water', 'has_salt']], # the first is our FEATURES\n", " df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Create classifier, train and test" ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import naive_bayes\n", "\n", "# Create a Bernoulli Naive Bayes classifier\n", "clf = naive_bayes.BernoulliNB()\n", "\n", "# Fit with our training data\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 162, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.98821458876771739" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9884349465744815" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dummy Classifier to see worst possible performance" ] }, { "cell_type": "code", "execution_count": 164, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.dummy import DummyClassifier" ] }, { "cell_type": "code", "execution_count": 165, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DummyClassifier(constant=None, random_state=None, strategy='most_frequent')" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_clf = DummyClassifier(strategy='most_frequent')\n", "\n", "# Fit with our training data\n", "dummy_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 166, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.98821458876771739" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_clf.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9884349465744815" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# We just got destroyed by math: let's actually understand Naive Bayes\n", "\n", "Naive Bayes gives you back a probability for each possible label - so, % chance that it's brazilian vs. the % chance that it is not brazilian. We'll use this to see what went wrong.\n", "\n", "**Math stuff**\n", "\n", "Naive Bayes is all about calculating the probability of \"B given A\", a.k.a., the chance of B being true if A is true.\n", "\n", "* **Bayes` Theorem:** `P(B|A) = P(A and B)/P(A)`\n", "\n", "* `P(A)` means \"what is the probability of A being true?\"\n", "* `P(B|A)` means \"if A is true, what is the probability of B being true?\"\n", "* `P(A and B)` means \"what is the probability of both A and B being true?\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: We have a recipe and it has water in it. Is it brazilian?\n", "\n", "**Hypothesis one: the recipe is brazilian**\n", "\n", "* `P(B|A)` would be \"if it contains water, what is the chance that it is brazilian cuisine?\"\n", "* `P(A and B)` would be \"what is the chance that it contains both water and is brazilian?\"\n", "* `P(A)` would be \"what is the chance that this contains water?\"" ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# P(B|A) = P(A and B)/P(A)" ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "109" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(A and B)\n", "# Probability that a recipe has water and is brazilian\n", "\n", "# How many recipes have water AND are brazilian?\n", "len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "39774" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(A)\n", "len(df['has_water'])" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.0027404837330919697" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(B|A)\n", "# The chance that a recipe is brazilian if it has water in it\n", "109/39774" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hypothesis two: the recipe is NOT brazilian**\n", "\n", "* `P(B|A)` would be \"if it contains water, what is the chance that it is NOT brazilian cuisine?\"\n", "* `P(A and B)` would be \"what is the chance that it contains both water and is NOT brazilian?\"\n", "* `P(A)` would be \"what is the chance that this contains water?\"" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9385" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(A and B)\n", "# Probability that a recipe has water and is NOT brazilian\n", "\n", "# How many recipes have water AND are NOT brazilian?\n", "len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "39774" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(A)\n", "# How many recipes have water?\n", "len(df['has_water'])" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.2359581636244783" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# P(B|A)\n", "# The chance that a recipe is NOT brazilian if it has water in it\n", "9385/39774" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What this boils down to\n", "\n", "No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn't really matter, it's probably not brazilian.\n" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "467" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[df['cuisine'] == 'brazilian'])" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "39774" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 177, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.011741338562880274" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Only a little bit over 1% of our recipes are brazilian\n", "# so even though it ALWAYS say it \"not brazilian\", it's usually right\n", "467/39774" ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9882586614371197" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 - 467/39774" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's fix up our labels\n", "\n", "Before we had this:\n", "\n", " def make_label(cuisine):\n", " if cuisine == \"brazilian\":\n", " return 1\n", " else:\n", " return 0\n", "\n", "which does not scale well. If we wanted to add in more different cuisines, we'd need to keep adding in else ifs again and again and again until our fingers fell off. And we'd probably misspell something. And if we're anything, it's LAZY.\n", "\n", "## LabelEncoder to the rescue: Converts categories into numeric labels" ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "le = preprocessing.LabelEncoder()" ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# LabelEncoder has two parts: FIT and TRANSFORM\n", "# FIT learns all of the possible labels\n", "# TRANSFORM takes a list of categories and converts them into numbers" ] }, { "cell_type": "code", "execution_count": 181, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LabelEncoder()" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Teach the label encoder all of the possible labels\n", "# It doesn't care about duplicates \n", "le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])" ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 3])" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the labels out as numbers\n", "le.transform(['orange', 'blue', 'yellow'])" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LabelEncoder()" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Send the label encoder each and every cuisine\n", "le.fit(df['cuisine'])" ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 6, 16, 4, ..., 8, 3, 13])" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "le.transform(df['cuisine'])" ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredient_listlabelhas_spaghettihas_curry_powderis_brazilianhas_waterhas_saltcuisine_label
0greek10259romaine lettuce, black olives, grape tomatoes,...0FalseFalse0FalseFalse6
1southern_us25693plain flour, ground pepper, salt, tomatoes, gr...0FalseFalse0FalseTrue16
2filipino20130eggs, pepper, salt, mayonaise, cooking oil, gr...0FalseFalse0FalseTrue4
\n", "
" ], "text/plain": [ " cuisine id ingredient_list \\\n", "0 greek 10259 romaine lettuce, black olives, grape tomatoes,... \n", "1 southern_us 25693 plain flour, ground pepper, salt, tomatoes, gr... \n", "2 filipino 20130 eggs, pepper, salt, mayonaise, cooking oil, gr... \n", "\n", " label has_spaghetti has_curry_powder is_brazilian has_water has_salt \\\n", "0 0 False False 0 False False \n", "1 0 False False 0 False True \n", "2 0 False False 0 False True \n", "\n", " cuisine_label \n", "0 6 \n", "1 16 \n", "2 4 " ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['cuisine_label'] = le.transform(df['cuisine'])\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's train and test with our new labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 186, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['has_water', 'has_salt']], # the first is our FEATURES\n", " df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import naive_bayes\n", "\n", "# Create a Bernoulli Naive Bayes classifier\n", "clf = naive_bayes.BernoulliNB()\n", "\n", "# Learn how related every cuisine is to water and salt\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.19840346962506678" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.20251414204902576" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's add some more features to see if we can do a better job\n", "\n", "Right now I'm only looking at water and salt which doesn't tell you much, maybe you're looking at tortillas or cumin or soy sauce which tells you a little bit more." ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['has_miso'] = df['ingredient_list'].str.contains(\"miso\")\n", "df['has_soy_sauce'] = df['ingredient_list'].str.contains(\"soy sauce\")\n", "df['has_cilantro'] = df['ingredient_list'].str.contains(\"cilantro\")\n", "df['has_black_olives'] = df['ingredient_list'].str.contains(\"black olives\")\n", "df['has_tortillas'] = df['ingredient_list'].str.contains(\"tortillas\")\n", "df['has_turmeric'] = df['ingredient_list'].str.contains(\"turmeric\")\n", "df['has_pistachios'] = df['ingredient_list'].str.contains(\"pistachios\")\n", "df['has_lemongrass'] = df['ingredient_list'].str.contains(\"lemongrass\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our new feature set is!!! `df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]`" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES\n", " df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import naive_bayes\n", "\n", "# Create a Bernoulli Naive Bayes classifier\n", "clf = naive_bayes.BernoulliNB()\n", "\n", "# Learn how related every cuisine is to water and salt\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.37232471165027187" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.36379635449402892" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# This is taking forever, please let there be an automatic way to pick out all of the words" ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# STEP ONE: .fit to learn all of the words\n", "# STEP TWO: .transform to turn a sentence into numbers\n", "\n", "#vectorizer = CountVectorizer()\n", "# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'\n", "# Only pick the top 3000 most frequent ngrams\n", "vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)" ] }, { "cell_type": "code", "execution_count": 196, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=3000, min_df=1,\n", " ngram_range=(1, 2), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We have some sentences\n", "# We're going to feed it to the vectorizer\n", "# and it's going to learn all of the words\n", "sentences = [\n", " \"cats are cool\",\n", " \"dogs are cool\"\n", "]\n", "vectorizer.fit(sentences)" ] }, { "cell_type": "code", "execution_count": 197, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<2x7 sparse matrix of type ''\n", "\twith 10 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We're going to take some sentences and feed it to the vectorizer\n", "# and its' going to convert it into numbers\n", "vectorizer.transform(sentences)" ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 1, 1, 1, 0, 0],\n", " [1, 1, 0, 0, 1, 1, 1]])" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# But it looks bad to look at so I'll use .toarray()\n", "vectorizer.transform(sentences).toarray()" ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 romaine lettuce, black olives, grape tomatoes,...\n", "1 plain flour, ground pepper, salt, tomatoes, gr...\n", "2 eggs, pepper, salt, mayonaise, cooking oil, gr...\n", "3 water, vegetable oil, wheat, salt\n", "4 black pepper, shallots, cornflour, cayenne pep...\n", "Name: ingredient_list, dtype: object" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In our case, our text is the list of ingredients. We can get it through\n", "df['ingredient_list'].head()" ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=3000, min_df=1,\n", " ngram_range=(1, 2), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dear vectorizer, please learn all of these words\n", "vectorizer.fit(df['ingredient_list'])" ] }, { "cell_type": "code", "execution_count": 201, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<39774x3000 sparse matrix of type ''\n", "\twith 1243216 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 201, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dear vectorizer, please convert ingredient_list into features\n", "# That we can do machine learning on\n", "\n", "every_single_word_features = vectorizer.transform(df['ingredient_list'])\n", "every_single_word_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Now let's try with our new complete labels and our new complete features that includes every single word" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " every_single_word_features,\n", " df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n", " test_size=0.2) # 80% training, 20% testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# This is Naive Bayes with every word as a feature pushed through the CountVectorizer" ] }, { "cell_type": "code", "execution_count": 203, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Naive Bayes\n", "CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms\n", "Wall time: 109 ms\n", "Training score: (stuff it already knows) 0.714384487256\n", "Testing score: (stuff it hasn't seen before): 0.680578252671\n" ] } ], "source": [ "print(\"This is Naive Bayes\")\n", "\n", "from sklearn import naive_bayes\n", "clf = naive_bayes.BernoulliNB()\n", "%time clf.fit(X_train, y_train)\n", "\n", "# How does it do on the training data?\n", "print(\"Training score: (stuff it already knows)\", clf.score(X_train, y_train))\n", "\n", "# How does it do on the testing data?\n", "print(\"Testing score: (stuff it hasn't seen before):\", clf.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# But maybe it's just chance? Let's try the Dummy Classifier" ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the Dummy Classifier\n", "CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms\n", "Wall time: 2.41 ms\n", "Training score: (stuff it already knows) 0.100254564883\n", "Testing score: (stuff it hasn't seen before): 0.0999371464488\n" ] } ], "source": [ "from sklearn.dummy import DummyClassifier\n", "\n", "print(\"This is the Dummy Classifier\")\n", "\n", "dummy_clf = DummyClassifier()\n", "%time dummy_clf.fit(X_train, y_train)\n", "\n", "# How does it do on the training data?\n", "print(\"Training score: (stuff it already knows)\", dummy_clf.score(X_train, y_train))\n", "\n", "# How does it do on the testing data?\n", "print(\"Testing score: (stuff it hasn't seen before):\", dummy_clf.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# This is a Decision Tree with every single feature from the CountVectorizer" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a Decision Tree\n", "CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s\n", "Wall time: 19.7 s\n", "Training score: (stuff it already knows) 0.999780005657\n", "Testing score: (stuff it hasn't seen before): 0.638592080453\n" ] } ], "source": [ "print(\"This is a Decision Tree\")\n", "\n", "from sklearn import tree\n", "tree_clf = tree.DecisionTreeClassifier()\n", "\n", "%time tree_clf.fit(X_train, y_train)\n", "\n", "# How does it do on the training data?\n", "print(\"Training score: (stuff it already knows)\", tree_clf.score(X_train, y_train))\n", "\n", "# How does it do on the testing data?\n", "print(\"Testing score: (stuff it hasn't seen before):\", tree_clf.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a Random Forest\n", "CPU times: user 10 s, sys: 288 ms, total: 10.3 s\n", "Wall time: 13.6 s\n", "Training score: (stuff it already knows) 0.992645903391\n", "Testing score: (stuff it hasn't seen before): 0.706096794469\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "print(\"This is a Random Forest\")\n", "\n", "tree_clf = RandomForestClassifier()\n", "\n", "%time tree_clf.fit(X_train, y_train)\n", "\n", "# How does it do on the training data?\n", "print(\"Training score: (stuff it already knows)\", tree_clf.score(X_train, y_train))\n", "\n", "# How does it do on the testing data?\n", "print(\"Testing score: (stuff it hasn't seen before):\", tree_clf.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How do you do this in the real world with new data?" ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "collapsed": true }, "outputs": [], "source": [ "every_single_word_features = vectorizer.transform(df['ingredient_list'])\n" ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<4x3000 sparse matrix of type ''\n", "\twith 35 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 207, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the Naive bayes thing\n", "from sklearn import naive_bayes\n", "clf = naive_bayes.BernoulliNB()\n", "\n", "# Give the classifier EVERYTHING we know, not holding back anything\n", "clf.fit(every_single_word_features, df['cuisine_label'])\n", "\n", "# We have some new stuff we have not categorized\n", "incoming_recipes = [\n", " \"spaghetti tomato sauce garlic onion water\",\n", " \"soy sauce ginger sugar butter\",\n", " \"green papaya thai chilies palm sugar\",\n", " \"butter oil salt black pepper water milk bubblegumpie\"\n", "]\n", "\n", "features_for_new_recipes = vectorizer.transform(incoming_recipes)\n", "features_for_new_recipes" ] }, { "cell_type": "code", "execution_count": 208, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 4, 11, 4, 16])" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions = clf.predict(features_for_new_recipes)\n", "predictions" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The predictions are all categories that the labelencoder decided on\n", "# Let's convert those numeric ones back into real fun cuisine words\n", "le.inverse_transform(predictions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }