{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Algorithms: Rules of Play\n",
    "\n",
    "1. Name of the algorithm\n",
    "2. What it's used for (classification, clustering, maybe other things?)\n",
    "3. Why is it better/worse than other classification/clustering/etc algorithms\n",
    "4. How to get our data into a format that is good for that algorithm\n",
    "4. REALISTIC data sets\n",
    "5. What the output means technically\n",
    "6. What the output means in like real life language and practically speaking\n",
    "7. What kind of datasets you use this algorithm for\n",
    "8. Examples of when it was used in journalism OR maybe could have been used\n",
    "9. Examples of when it was used period\n",
    "10. Pitfalls\n",
    "11. Maybe maybe maybe a little bit of math\n",
    "12. How to ground them for a less technical audience and to help engage them in what the algorithm is doing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes\n",
    "\n",
    "Download and extract `recipes.csv.zip` from `#algorithms` and start a new Jupyter Notebook!!!!\n",
    "\n",
    "**Classification algorithm** - spam filter\n",
    "\n",
    "The more spammy words that are in an email, the more like it is to be spam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...\n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...\n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...\n",
       "3       indian  22213                  water, vegetable oil, wheat, salt\n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep..."
      ]
     },
     "execution_count": 131,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"recipes.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# QUESTION ONE: What are we doing and why are we using Naive Bayes?\n",
    "\n",
    "We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?\n",
    "\n",
    "We're going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it's italian because we love italian food and we only want to eat italian food.\n",
    "\n",
    "RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.\n",
    "\n",
    "**For clustering**\n",
    "\n",
    "1. You'll get a lot of documents\n",
    "2. You feed it to an algorithm, tell it create `x` number of categories\n",
    "3. The machine gives you back categories whether they make sense or not\n",
    "\n",
    "**For classification (which we are doing now)**\n",
    "\n",
    "1. You'll get a lot of documents\n",
    "2. You'll classify some of them into categories that you know and love\n",
    "3. You'll ask the algorithm what categories a new bunch of unlabeled documents end up in\n",
    "\n",
    "All mean the same thing: CATEGORY = CLASS = LABEL\n",
    "\n",
    "The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).\n",
    "\n",
    "## How does Naive Bayes work?\n",
    "\n",
    "NAIVE BAYES WORKS WITH TEXT (kind of)\n",
    "\n",
    "**Bayes Theorem (kind of)**\n",
    "\n",
    "* If you see a word that is normally in a spam email, there's a higher chance it's spam\n",
    "* If you see a word that is normally in a non-spam email, there's a higher chance it's not spam\n",
    "\n",
    "**Naive:** every word/ingredient/etc is independent of any other word\n",
    "\n",
    "FOR US: If you see ingredients that are normally in italian food, it's probably italian\n",
    "\n",
    "Secret trick: you can't just use text, you have to convert into numbers\n",
    "\n",
    "## Types of Naive Bayes\n",
    "\n",
    "Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.\n",
    "\n",
    "**Multinominal Naive Bayes - (multiple numbers)**: You count the words. You care about whether a word appears once or twice or three times or ten times. *This is better for long passages*\n",
    "\n",
    "**Bernoulli Naive Bayes - True/False Bayes:** You only care if the word shows up (`True`) or it doesn't show up (`False`) - *this is better for short passages*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# STEP ONE: Let's convert our text data into numerical data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...\n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...\n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...\n",
       "3       indian  22213                  water, vegetable oil, wheat, salt\n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep..."
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Our problem:** Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn't matter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Two things to convert into numbers:**\n",
    "\n",
    "* Our labels (a.k.a. the categories everything belongs in)\n",
    "* Our features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting our labels into numbers\n",
    "\n",
    "We have two labels\n",
    "\n",
    "* italian = `1`\n",
    "* not italian = `0`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...\n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...\n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...\n",
       "3       indian  22213                  water, vegetable oil, wheat, salt\n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep..."
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def make_label(cuisine):\n",
    "    if cuisine == \"italian\":\n",
    "        return 1\n",
    "    else:\n",
    "        return 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>jamaican</td>\n",
       "      <td>6602</td>\n",
       "      <td>plain flour, sugar, butter, eggs, fresh ginger...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>spanish</td>\n",
       "      <td>42779</td>\n",
       "      <td>olive oil, salt, medium shrimp, pepper, garlic...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>italian</td>\n",
       "      <td>3735</td>\n",
       "      <td>sugar, pistachio nuts, white almond bark, flou...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>mexican</td>\n",
       "      <td>16903</td>\n",
       "      <td>olive oil, purple onion, fresh pineapple, pork...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>italian</td>\n",
       "      <td>12734</td>\n",
       "      <td>chopped tomatoes, fresh basil, garlic, extra-v...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...   \n",
       "3       indian  22213                  water, vegetable oil, wheat, salt   \n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep...   \n",
       "5     jamaican   6602  plain flour, sugar, butter, eggs, fresh ginger...   \n",
       "6      spanish  42779  olive oil, salt, medium shrimp, pepper, garlic...   \n",
       "7      italian   3735  sugar, pistachio nuts, white almond bark, flou...   \n",
       "8      mexican  16903  olive oil, purple onion, fresh pineapple, pork...   \n",
       "9      italian  12734  chopped tomatoes, fresh basil, garlic, extra-v...   \n",
       "\n",
       "   label  \n",
       "0      0  \n",
       "1      0  \n",
       "2      0  \n",
       "3      0  \n",
       "4      0  \n",
       "5      0  \n",
       "6      0  \n",
       "7      1  \n",
       "8      0  \n",
       "9      1  "
      ]
     },
     "execution_count": 135,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'] = df['cuisine'].apply(make_label)\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting our features into numbers\n",
    "\n",
    "**Feature selection:** The process of selecting the features that matter, in this case - what ingredients do we want to look at?\n",
    "\n",
    "Our feature is going to be: whether it has spaghetti or not"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>jamaican</td>\n",
       "      <td>6602</td>\n",
       "      <td>plain flour, sugar, butter, eggs, fresh ginger...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>spanish</td>\n",
       "      <td>42779</td>\n",
       "      <td>olive oil, salt, medium shrimp, pepper, garlic...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>italian</td>\n",
       "      <td>3735</td>\n",
       "      <td>sugar, pistachio nuts, white almond bark, flou...</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>mexican</td>\n",
       "      <td>16903</td>\n",
       "      <td>olive oil, purple onion, fresh pineapple, pork...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>italian</td>\n",
       "      <td>12734</td>\n",
       "      <td>chopped tomatoes, fresh basil, garlic, extra-v...</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...   \n",
       "3       indian  22213                  water, vegetable oil, wheat, salt   \n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep...   \n",
       "5     jamaican   6602  plain flour, sugar, butter, eggs, fresh ginger...   \n",
       "6      spanish  42779  olive oil, salt, medium shrimp, pepper, garlic...   \n",
       "7      italian   3735  sugar, pistachio nuts, white almond bark, flou...   \n",
       "8      mexican  16903  olive oil, purple onion, fresh pineapple, pork...   \n",
       "9      italian  12734  chopped tomatoes, fresh basil, garlic, extra-v...   \n",
       "\n",
       "   label has_spaghetti has_curry_powder  \n",
       "0      0         False            False  \n",
       "1      0         False            False  \n",
       "2      0         False            False  \n",
       "3      0         False            False  \n",
       "4      0         False            False  \n",
       "5      0         False            False  \n",
       "6      0         False            False  \n",
       "7      1         False            False  \n",
       "8      0         False            False  \n",
       "9      1         False            False  "
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['has_spaghetti'] = df['ingredient_list'].str.contains(\"spaghetti\")\n",
    "df['has_curry_powder'] = df['ingredient_list'].str.contains(\"curry powder\")\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's run our tests\n",
    "\n",
    "Let's feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!\n",
    "\n",
    "### Looking at our labels\n",
    "\n",
    "We stored it in `label`, and if it's `0` it's not italian, if it's `1` it is Italian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    0\n",
       "2    0\n",
       "3    0\n",
       "4    0\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 137,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look at our features\n",
    "\n",
    "We have two features `has_spaghetti` and `has_curry_powder`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  has_spaghetti has_curry_powder\n",
       "0         False            False\n",
       "1         False            False\n",
       "2         False            False\n",
       "3         False            False\n",
       "4         False            False"
      ]
     },
     "execution_count": 138,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['has_spaghetti', 'has_curry_powder']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now let's finally do this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We need to split into training and testing data\n",
    "from sklearn.cross_validation import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Splitting into...\n",
    "# X = are all our features\n",
    "# y = are all our labels\n",
    "# X_train are our features to train on (80%)\n",
    "# y_train are our labels to train on (80%)\n",
    "# X_test are our features to test on (20%)\n",
    "# y_train are our labels to test on (20%)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES\n",
    "    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>18816</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30480</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19110</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29312</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23782</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7907</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2456</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27221</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5228</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37623</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18641</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9029</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5549</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21559</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12168</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4836</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3560</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21578</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33579</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5965</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32581</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5274</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1544</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10885</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22168</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29798</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31228</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4636</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38889</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39444</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19956</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14863</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8335</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21372</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8720</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11752</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10551</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37474</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7905</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5923</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14526</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>673</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30444</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15322</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5476</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37545</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32634</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36936</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18970</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5622</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10731</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37097</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5822</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35856</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7579</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27918</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8601</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5245</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39665</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13013</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>31819 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      has_spaghetti has_curry_powder\n",
       "18816         False            False\n",
       "30480         False            False\n",
       "19110         False            False\n",
       "29312         False            False\n",
       "23782         False            False\n",
       "7907          False            False\n",
       "2456          False            False\n",
       "27221         False            False\n",
       "5228          False            False\n",
       "37623         False            False\n",
       "18641         False            False\n",
       "9029          False            False\n",
       "5549          False            False\n",
       "21559         False            False\n",
       "12168         False            False\n",
       "4836          False            False\n",
       "3560          False            False\n",
       "21578         False            False\n",
       "33579         False            False\n",
       "5965          False            False\n",
       "32581         False            False\n",
       "5274          False            False\n",
       "1544          False            False\n",
       "10885         False            False\n",
       "22168         False            False\n",
       "29798         False            False\n",
       "31228         False            False\n",
       "4636          False            False\n",
       "38889         False            False\n",
       "39444         False            False\n",
       "...             ...              ...\n",
       "19956         False            False\n",
       "14863         False            False\n",
       "8335          False            False\n",
       "21372         False            False\n",
       "8720          False            False\n",
       "11752         False            False\n",
       "10551         False            False\n",
       "37474         False            False\n",
       "7905          False            False\n",
       "5923          False            False\n",
       "14526         False            False\n",
       "673           False            False\n",
       "30444         False            False\n",
       "15322         False            False\n",
       "5476          False            False\n",
       "37545         False            False\n",
       "32634         False            False\n",
       "36936         False            False\n",
       "18970         False            False\n",
       "5622          False            False\n",
       "10731         False            False\n",
       "37097         False            False\n",
       "5822          False            False\n",
       "35856         False            False\n",
       "7579          False            False\n",
       "27918         False            False\n",
       "8601          False            False\n",
       "5245          False            False\n",
       "39665         False            False\n",
       "13013         False            False\n",
       "\n",
       "[31819 rows x 2 columns]"
      ]
     },
     "execution_count": 141,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Oh hey, it's just our features from the dataframe\n",
    "X_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23827</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24607</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16829</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6473</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23662</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19742</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37244</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19552</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6361</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6786</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27241</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9034</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34423</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33399</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19641</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15389</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11627</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25811</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22079</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5254</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22499</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18948</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13672</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31390</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26623</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36470</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14916</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22337</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27339</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38540</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3409</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38281</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12014</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10908</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4647</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22629</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32925</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20743</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25604</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34821</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38273</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24241</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28217</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25094</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9433</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3755</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12877</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37839</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30193</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5866</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22191</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29451</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29878</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26103</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9126</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32127</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34047</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3324</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31076</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>104</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7955 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      has_spaghetti has_curry_powder\n",
       "23827         False            False\n",
       "24607         False            False\n",
       "16829         False            False\n",
       "6473          False            False\n",
       "23662         False            False\n",
       "19742         False            False\n",
       "37244         False            False\n",
       "19552         False            False\n",
       "6361          False            False\n",
       "6786          False            False\n",
       "27241         False            False\n",
       "9034          False            False\n",
       "34423         False            False\n",
       "33399         False            False\n",
       "19641         False             True\n",
       "15389         False             True\n",
       "11627         False            False\n",
       "25811         False            False\n",
       "22079         False            False\n",
       "5254          False            False\n",
       "22499         False            False\n",
       "18948         False            False\n",
       "13672         False            False\n",
       "31390         False            False\n",
       "26623         False            False\n",
       "36470         False            False\n",
       "14916         False            False\n",
       "22337         False            False\n",
       "27339         False            False\n",
       "38540         False            False\n",
       "...             ...              ...\n",
       "3409          False            False\n",
       "38281         False            False\n",
       "12014         False            False\n",
       "10908         False            False\n",
       "4647          False            False\n",
       "22629         False            False\n",
       "32925         False            False\n",
       "20743         False            False\n",
       "25604         False            False\n",
       "34821         False            False\n",
       "38273         False            False\n",
       "24241         False            False\n",
       "28217         False            False\n",
       "25094         False            False\n",
       "9433          False            False\n",
       "3755          False            False\n",
       "12877         False            False\n",
       "37839         False            False\n",
       "30193         False            False\n",
       "5866          False            False\n",
       "22191         False            False\n",
       "29451         False             True\n",
       "29878         False            False\n",
       "26103         False            False\n",
       "9126          False            False\n",
       "32127         False            False\n",
       "34047         False            False\n",
       "3324          False            False\n",
       "31076         False            False\n",
       "104           False            False\n",
       "\n",
       "[7955 rows x 2 columns]"
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# X is always the features, whether it's for training or for testing\n",
    "X_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "31819"
      ]
     },
     "execution_count": 143,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7955"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# We're testing on ~8000 and training on ~32000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "18816    0\n",
       "30480    0\n",
       "19110    0\n",
       "29312    1\n",
       "23782    0\n",
       "7907     0\n",
       "2456     0\n",
       "27221    0\n",
       "5228     0\n",
       "37623    0\n",
       "18641    0\n",
       "9029     0\n",
       "5549     0\n",
       "21559    0\n",
       "12168    0\n",
       "4836     0\n",
       "3560     0\n",
       "21578    0\n",
       "33579    0\n",
       "5965     0\n",
       "32581    0\n",
       "5274     1\n",
       "1544     1\n",
       "10885    0\n",
       "22168    0\n",
       "29798    1\n",
       "31228    0\n",
       "4636     1\n",
       "38889    0\n",
       "39444    0\n",
       "        ..\n",
       "19956    0\n",
       "14863    1\n",
       "8335     0\n",
       "21372    1\n",
       "8720     0\n",
       "11752    0\n",
       "10551    1\n",
       "37474    1\n",
       "7905     1\n",
       "5923     1\n",
       "14526    1\n",
       "673      0\n",
       "30444    0\n",
       "15322    0\n",
       "5476     0\n",
       "37545    1\n",
       "32634    0\n",
       "36936    0\n",
       "18970    0\n",
       "5622     0\n",
       "10731    0\n",
       "37097    0\n",
       "5822     0\n",
       "35856    0\n",
       "7579     0\n",
       "27918    1\n",
       "8601     0\n",
       "5245     0\n",
       "39665    0\n",
       "13013    0\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 146,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# y_train is our labels that we are training one\n",
    "y_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "23827    0\n",
       "24607    0\n",
       "16829    1\n",
       "6473     0\n",
       "23662    0\n",
       "19742    0\n",
       "37244    0\n",
       "19552    1\n",
       "6361     0\n",
       "6786     0\n",
       "27241    0\n",
       "9034     1\n",
       "34423    1\n",
       "33399    0\n",
       "19641    0\n",
       "15389    0\n",
       "11627    0\n",
       "25811    0\n",
       "22079    1\n",
       "5254     0\n",
       "22499    0\n",
       "18948    1\n",
       "13672    1\n",
       "31390    0\n",
       "26623    1\n",
       "36470    0\n",
       "14916    1\n",
       "22337    0\n",
       "27339    0\n",
       "38540    0\n",
       "        ..\n",
       "3409     1\n",
       "38281    0\n",
       "12014    1\n",
       "10908    0\n",
       "4647     0\n",
       "22629    0\n",
       "32925    0\n",
       "20743    0\n",
       "25604    1\n",
       "34821    0\n",
       "38273    1\n",
       "24241    1\n",
       "28217    0\n",
       "25094    0\n",
       "9433     0\n",
       "3755     1\n",
       "12877    0\n",
       "37839    0\n",
       "30193    0\n",
       "5866     0\n",
       "22191    0\n",
       "29451    0\n",
       "29878    0\n",
       "26103    0\n",
       "9126     0\n",
       "32127    0\n",
       "34047    0\n",
       "3324     0\n",
       "31076    0\n",
       "104      0\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# And y_test is the labels we're testing on\n",
    "y_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of training labels: 31819\n",
      "Length of testing labels: 7955\n",
      "Length of training features: 31819\n",
      "Length of testing features: 7955\n"
     ]
    }
   ],
   "source": [
    "print(\"Length of training labels:\", len(y_train))\n",
    "print(\"Length of testing labels:\", len(y_test))\n",
    "print(\"Length of training features:\", len(X_train))\n",
    "print(\"Length of testing features:\", len(X_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Basically all that happened was `train_test_split` took us from having a nice dataframe where everything was together and split it into two groups of two - separated our labels vs. our features, and our training data vs our testing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Back to actually doing our fitting etc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Splitting into...\n",
    "# X = are all our features\n",
    "# y = are all our labels\n",
    "# X_train are our features to train on (80%)\n",
    "# y_train are our labels to train on (80%)\n",
    "# X_test are our features to test on (20%)\n",
    "# y_train are our labels to test on (20%)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES\n",
    "    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import naive_bayes to get access to ALL kinds of naive bayes classifiers\n",
    "# But REMEMBER we're using Bernoulli because it's for true/false which is fine\n",
    "# for small passages\n",
    "from sklearn import naive_bayes\n",
    "\n",
    "# Create a Bernoulli Naive Bayes classifier\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "\n",
    "# Feed the classifier two things:\n",
    "#   * our training features (X_train)\n",
    "#   * our training labels (y_train)\n",
    "# To help it study for the exam later when we test it\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This looks ugly but in theory it's what every recipe is\n",
    "# All those zeroes = not italian\n",
    "# We know the first three aren't italian and the last three aren't italian\n",
    "clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.81083629278104274"
      ]
     },
     "execution_count": 152,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Naive Bayes can't overfit, really\n",
    "# It can't \"study too hard\" it can't \"memorize the questions\"\n",
    "# (a decision tree can)\n",
    "# So if we give it the training data back it will get some wrong\n",
    "clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.80905091137649277"
      ]
     },
     "execution_count": 153,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "italian         7838\n",
       "mexican         6438\n",
       "southern_us     4320\n",
       "indian          3003\n",
       "chinese         2673\n",
       "french          2646\n",
       "cajun_creole    1546\n",
       "thai            1539\n",
       "japanese        1423\n",
       "greek           1175\n",
       "spanish          989\n",
       "korean           830\n",
       "vietnamese       825\n",
       "moroccan         821\n",
       "british          804\n",
       "filipino         755\n",
       "irish            667\n",
       "jamaican         526\n",
       "russian          489\n",
       "brazilian        467\n",
       "Name: cuisine, dtype: int64"
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['cuisine'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "5        False\n",
       "6        False\n",
       "7        False\n",
       "8        False\n",
       "9        False\n",
       "10       False\n",
       "11       False\n",
       "12       False\n",
       "13       False\n",
       "14       False\n",
       "15       False\n",
       "16       False\n",
       "17       False\n",
       "18       False\n",
       "19       False\n",
       "20       False\n",
       "21       False\n",
       "22       False\n",
       "23       False\n",
       "24       False\n",
       "25       False\n",
       "26       False\n",
       "27        True\n",
       "28       False\n",
       "29       False\n",
       "         ...  \n",
       "39744    False\n",
       "39745    False\n",
       "39746    False\n",
       "39747    False\n",
       "39748    False\n",
       "39749    False\n",
       "39750    False\n",
       "39751    False\n",
       "39752    False\n",
       "39753    False\n",
       "39754    False\n",
       "39755    False\n",
       "39756    False\n",
       "39757    False\n",
       "39758    False\n",
       "39759    False\n",
       "39760    False\n",
       "39761    False\n",
       "39762    False\n",
       "39763    False\n",
       "39764    False\n",
       "39765    False\n",
       "39766    False\n",
       "39767     True\n",
       "39768    False\n",
       "39769    False\n",
       "39770    False\n",
       "39771    False\n",
       "39772    False\n",
       "39773    False\n",
       "Name: has_spaghetti, dtype: bool"
      ]
     },
     "execution_count": 211,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['has_spaghetti']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 214,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>has_spaghetti</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39744</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39745</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39746</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39747</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39748</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39749</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39750</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39751</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39752</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39753</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39754</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39755</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39756</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39757</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39758</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39759</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39760</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39761</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39762</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39763</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39764</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39765</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39766</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39767</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39768</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39769</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39770</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39771</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39772</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39773</th>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>39774 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      has_spaghetti\n",
       "0             False\n",
       "1             False\n",
       "2             False\n",
       "3             False\n",
       "4             False\n",
       "5             False\n",
       "6             False\n",
       "7             False\n",
       "8             False\n",
       "9             False\n",
       "10            False\n",
       "11            False\n",
       "12            False\n",
       "13            False\n",
       "14            False\n",
       "15            False\n",
       "16            False\n",
       "17            False\n",
       "18            False\n",
       "19            False\n",
       "20            False\n",
       "21            False\n",
       "22            False\n",
       "23            False\n",
       "24            False\n",
       "25            False\n",
       "26            False\n",
       "27             True\n",
       "28            False\n",
       "29            False\n",
       "...             ...\n",
       "39744         False\n",
       "39745         False\n",
       "39746         False\n",
       "39747         False\n",
       "39748         False\n",
       "39749         False\n",
       "39750         False\n",
       "39751         False\n",
       "39752         False\n",
       "39753         False\n",
       "39754         False\n",
       "39755         False\n",
       "39756         False\n",
       "39757         False\n",
       "39758         False\n",
       "39759         False\n",
       "39760         False\n",
       "39761         False\n",
       "39762         False\n",
       "39763         False\n",
       "39764         False\n",
       "39765         False\n",
       "39766         False\n",
       "39767          True\n",
       "39768         False\n",
       "39769         False\n",
       "39770         False\n",
       "39771         False\n",
       "39772         False\n",
       "39773         False\n",
       "\n",
       "[39774 rows x 1 columns]"
      ]
     },
     "execution_count": 214,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#df[['has_spaghetti', 'has_curry_powder']]\n",
    "df[['has_spaghetti']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>indian</td>\n",
       "      <td>22213</td>\n",
       "      <td>water, vegetable oil, wheat, salt</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>indian</td>\n",
       "      <td>13162</td>\n",
       "      <td>black pepper, shallots, cornflour, cayenne pep...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...   \n",
       "3       indian  22213                  water, vegetable oil, wheat, salt   \n",
       "4       indian  13162  black pepper, shallots, cornflour, cayenne pep...   \n",
       "\n",
       "   label has_spaghetti has_curry_powder  \n",
       "0      0         False            False  \n",
       "1      0         False            False  \n",
       "2      0         False            False  \n",
       "3      0         False            False  \n",
       "4      0         False            False  "
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wow, we did a really great job! Let's try another cuisine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Preparing our data\n",
    "\n",
    "### Creating labels that scikit-learn can use\n",
    "\n",
    "Our cuisine is , so we'll do `0` and `1` as to whether it's that cuisine or not "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def make_label(cuisine):\n",
    "    if cuisine == \"brazilian\":\n",
    "        return 1\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "df['is_brazilian'] = df['cuisine'].apply(make_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "      <th>is_brazilian</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "\n",
       "   label has_spaghetti has_curry_powder  is_brazilian  \n",
       "0      0         False            False             0  \n",
       "1      0         False            False             0  "
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating features that scikit-learn can use\n",
    "\n",
    "It's Bernoulli Naive Bayes, so it's `True` and `False`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df['has_water'] = df['ingredient_list'].str.contains('water')\n",
    "df['has_salt'] = df['ingredient_list'].str.contains('salt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "      <th>is_brazilian</th>\n",
       "      <th>has_water</th>\n",
       "      <th>has_salt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "\n",
       "   label has_spaghetti has_curry_powder  is_brazilian has_water has_salt  \n",
       "0      0         False            False             0     False    False  \n",
       "1      0         False            False             0     False     True  "
      ]
     },
     "execution_count": 159,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create the test/train split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df[['has_water', 'has_salt']], # the first is our FEATURES\n",
    "    df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Create classifier, train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 161,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import naive_bayes\n",
    "\n",
    "# Create a Bernoulli Naive Bayes classifier\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "\n",
    "# Fit with our training data\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98821458876771739"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9884349465744815"
      ]
     },
     "execution_count": 163,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dummy Classifier to see worst possible performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DummyClassifier(constant=None, random_state=None, strategy='most_frequent')"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_clf = DummyClassifier(strategy='most_frequent')\n",
    "\n",
    "# Fit with our training data\n",
    "dummy_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98821458876771739"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9884349465744815"
      ]
     },
     "execution_count": 167,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# We just got destroyed by math: let's actually understand Naive Bayes\n",
    "\n",
    "Naive Bayes gives you back a probability for each possible label - so, % chance that it's brazilian vs. the % chance that it is not brazilian. We'll use this to see what went wrong.\n",
    "\n",
    "**Math stuff**\n",
    "\n",
    "Naive Bayes is all about calculating the probability of \"B given A\", a.k.a., the chance of B being true if A is true.\n",
    "\n",
    "* **Bayes` Theorem:** `P(B|A) = P(A and B)/P(A)`\n",
    "\n",
    "* `P(A)` means \"what is the probability of A being true?\"\n",
    "* `P(B|A)` means \"if A is true, what is the probability of B being true?\"\n",
    "* `P(A and B)` means \"what is the probability of both A and B being true?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: We have a recipe and it has water in it. Is it brazilian?\n",
    "\n",
    "**Hypothesis one: the recipe is brazilian**\n",
    "\n",
    "* `P(B|A)` would be \"if it contains water, what is the chance that it is brazilian cuisine?\"\n",
    "* `P(A and B)` would be \"what is the chance that it contains both water and is brazilian?\"\n",
    "* `P(A)` would be \"what is the chance that this contains water?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# P(B|A) = P(A and B)/P(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "109"
      ]
     },
     "execution_count": 169,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(A and B)\n",
    "# Probability that a recipe has water and is brazilian\n",
    "\n",
    "# How many recipes have water AND are brazilian?\n",
    "len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39774"
      ]
     },
     "execution_count": 170,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(A)\n",
    "len(df['has_water'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0027404837330919697"
      ]
     },
     "execution_count": 171,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(B|A)\n",
    "# The chance that a recipe is brazilian if it has water in it\n",
    "109/39774"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hypothesis two: the recipe is NOT brazilian**\n",
    "\n",
    "* `P(B|A)` would be \"if it contains water, what is the chance that it is NOT brazilian cuisine?\"\n",
    "* `P(A and B)` would be \"what is the chance that it contains both water and is NOT brazilian?\"\n",
    "* `P(A)` would be \"what is the chance that this contains water?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9385"
      ]
     },
     "execution_count": 172,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(A and B)\n",
    "# Probability that a recipe has water and is NOT brazilian\n",
    "\n",
    "# How many recipes have water AND are NOT brazilian?\n",
    "len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39774"
      ]
     },
     "execution_count": 173,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(A)\n",
    "# How many recipes have water?\n",
    "len(df['has_water'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2359581636244783"
      ]
     },
     "execution_count": 174,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# P(B|A)\n",
    "# The chance that a recipe is NOT brazilian if it has water in it\n",
    "9385/39774"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What this boils down to\n",
    "\n",
    "No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn't really matter, it's probably not brazilian.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "467"
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df[df['cuisine'] == 'brazilian'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39774"
      ]
     },
     "execution_count": 176,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.011741338562880274"
      ]
     },
     "execution_count": 177,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Only a little bit over 1% of our recipes are brazilian\n",
    "# so even though it ALWAYS say it \"not brazilian\", it's usually right\n",
    "467/39774"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9882586614371197"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 - 467/39774"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Let's fix up our labels\n",
    "\n",
    "Before we had this:\n",
    "\n",
    "    def make_label(cuisine):\n",
    "        if cuisine == \"brazilian\":\n",
    "            return 1\n",
    "        else:\n",
    "            return 0\n",
    "\n",
    "which does not scale well. If we wanted to add in more different cuisines, we'd need to keep adding in else ifs again and again and again until our fingers fell off. And we'd probably misspell something. And if we're anything, it's LAZY.\n",
    "\n",
    "## LabelEncoder to the rescue: Converts categories into numeric labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "\n",
    "le = preprocessing.LabelEncoder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# LabelEncoder has two parts: FIT and TRANSFORM\n",
    "# FIT learns all of the possible labels\n",
    "# TRANSFORM takes a list of categories and converts them into numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LabelEncoder()"
      ]
     },
     "execution_count": 181,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Teach the label encoder all of the possible labels\n",
    "# It doesn't care about duplicates \n",
    "le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 3])"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the labels out as numbers\n",
    "le.transform(['orange', 'blue', 'yellow'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LabelEncoder()"
      ]
     },
     "execution_count": 183,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Send the label encoder each and every cuisine\n",
    "le.fit(df['cuisine'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 6, 16,  4, ...,  8,  3, 13])"
      ]
     },
     "execution_count": 184,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "le.transform(df['cuisine'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cuisine</th>\n",
       "      <th>id</th>\n",
       "      <th>ingredient_list</th>\n",
       "      <th>label</th>\n",
       "      <th>has_spaghetti</th>\n",
       "      <th>has_curry_powder</th>\n",
       "      <th>is_brazilian</th>\n",
       "      <th>has_water</th>\n",
       "      <th>has_salt</th>\n",
       "      <th>cuisine_label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>greek</td>\n",
       "      <td>10259</td>\n",
       "      <td>romaine lettuce, black olives, grape tomatoes,...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>southern_us</td>\n",
       "      <td>25693</td>\n",
       "      <td>plain flour, ground pepper, salt, tomatoes, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>filipino</td>\n",
       "      <td>20130</td>\n",
       "      <td>eggs, pepper, salt, mayonaise, cooking oil, gr...</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cuisine     id                                    ingredient_list  \\\n",
       "0        greek  10259  romaine lettuce, black olives, grape tomatoes,...   \n",
       "1  southern_us  25693  plain flour, ground pepper, salt, tomatoes, gr...   \n",
       "2     filipino  20130  eggs, pepper, salt, mayonaise, cooking oil, gr...   \n",
       "\n",
       "   label has_spaghetti has_curry_powder  is_brazilian has_water has_salt  \\\n",
       "0      0         False            False             0     False    False   \n",
       "1      0         False            False             0     False     True   \n",
       "2      0         False            False             0     False     True   \n",
       "\n",
       "   cuisine_label  \n",
       "0              6  \n",
       "1             16  \n",
       "2              4  "
      ]
     },
     "execution_count": 185,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['cuisine_label'] = le.transform(df['cuisine'])\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's train and test with our new labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df[['has_water', 'has_salt']], # the first is our FEATURES\n",
    "    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 187,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import naive_bayes\n",
    "\n",
    "# Create a Bernoulli Naive Bayes classifier\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "\n",
    "# Learn how related every cuisine is to water and salt\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.19840346962506678"
      ]
     },
     "execution_count": 188,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.20251414204902576"
      ]
     },
     "execution_count": 189,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Let's add some more features to see if we can do a better job\n",
    "\n",
    "Right now I'm only looking at water and salt which doesn't tell you much, maybe you're looking at tortillas or cumin or soy sauce which tells you a little bit more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df['has_miso'] = df['ingredient_list'].str.contains(\"miso\")\n",
    "df['has_soy_sauce'] = df['ingredient_list'].str.contains(\"soy sauce\")\n",
    "df['has_cilantro'] = df['ingredient_list'].str.contains(\"cilantro\")\n",
    "df['has_black_olives'] = df['ingredient_list'].str.contains(\"black olives\")\n",
    "df['has_tortillas'] = df['ingredient_list'].str.contains(\"tortillas\")\n",
    "df['has_turmeric'] = df['ingredient_list'].str.contains(\"turmeric\")\n",
    "df['has_pistachios'] = df['ingredient_list'].str.contains(\"pistachios\")\n",
    "df['has_lemongrass'] = df['ingredient_list'].str.contains(\"lemongrass\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our new feature set is!!! `df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES\n",
    "    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 192,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import naive_bayes\n",
    "\n",
    "# Create a Bernoulli Naive Bayes classifier\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "\n",
    "# Learn how related every cuisine is to water and salt\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.37232471165027187"
      ]
     },
     "execution_count": 193,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.36379635449402892"
      ]
     },
     "execution_count": 194,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is taking forever, please let there be an automatic way to pick out all of the words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# STEP ONE: .fit to learn all of the words\n",
    "# STEP TWO: .transform to turn a sentence into numbers\n",
    "\n",
    "#vectorizer = CountVectorizer()\n",
    "# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'\n",
    "# Only pick the top 3000 most frequent ngrams\n",
    "vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=3000, min_df=1,\n",
       "        ngram_range=(1, 2), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 196,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We have some sentences\n",
    "# We're going to feed it to the vectorizer\n",
    "# and it's going to learn all of the words\n",
    "sentences = [\n",
    "    \"cats are cool\",\n",
    "    \"dogs are cool\"\n",
    "]\n",
    "vectorizer.fit(sentences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2x7 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 10 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 197,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We're going to take some sentences and feed it to the vectorizer\n",
    "# and its' going to convert it into numbers\n",
    "vectorizer.transform(sentences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 1, 1, 1, 1, 0, 0],\n",
       "       [1, 1, 0, 0, 1, 1, 1]])"
      ]
     },
     "execution_count": 198,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# But it looks bad to look at so I'll use .toarray()\n",
    "vectorizer.transform(sentences).toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    romaine lettuce, black olives, grape tomatoes,...\n",
       "1    plain flour, ground pepper, salt, tomatoes, gr...\n",
       "2    eggs, pepper, salt, mayonaise, cooking oil, gr...\n",
       "3                    water, vegetable oil, wheat, salt\n",
       "4    black pepper, shallots, cornflour, cayenne pep...\n",
       "Name: ingredient_list, dtype: object"
      ]
     },
     "execution_count": 199,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# In our case, our text is the list of ingredients. We can get it through\n",
    "df['ingredient_list'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=3000, min_df=1,\n",
       "        ngram_range=(1, 2), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 200,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dear vectorizer, please learn all of these words\n",
    "vectorizer.fit(df['ingredient_list'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<39774x3000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 1243216 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 201,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dear vectorizer, please convert ingredient_list into features\n",
    "# That we can do machine learning on\n",
    "\n",
    "every_single_word_features = vectorizer.transform(df['ingredient_list'])\n",
    "every_single_word_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Now let's try with our new complete labels and our new complete features that includes every single word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    every_single_word_features,\n",
    "    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)\n",
    "    test_size=0.2) # 80% training, 20% testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is Naive Bayes with every word as a feature pushed through the CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is Naive Bayes\n",
      "CPU times: user 55.8 ms, sys: 17.2 ms, total: 73 ms\n",
      "Wall time: 109 ms\n",
      "Training score: (stuff it already knows) 0.714384487256\n",
      "Testing score: (stuff it hasn't seen before): 0.680578252671\n"
     ]
    }
   ],
   "source": [
    "print(\"This is Naive Bayes\")\n",
    "\n",
    "from sklearn import naive_bayes\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "%time clf.fit(X_train, y_train)\n",
    "\n",
    "# How does it do on the training data?\n",
    "print(\"Training score: (stuff it already knows)\", clf.score(X_train, y_train))\n",
    "\n",
    "# How does it do on the testing data?\n",
    "print(\"Testing score: (stuff it hasn't seen before):\", clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# But maybe it's just chance? Let's try the Dummy Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is the Dummy Classifier\n",
      "CPU times: user 2.58 ms, sys: 397 µs, total: 2.98 ms\n",
      "Wall time: 2.41 ms\n",
      "Training score: (stuff it already knows) 0.100254564883\n",
      "Testing score: (stuff it hasn't seen before): 0.0999371464488\n"
     ]
    }
   ],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "print(\"This is the Dummy Classifier\")\n",
    "\n",
    "dummy_clf = DummyClassifier()\n",
    "%time dummy_clf.fit(X_train, y_train)\n",
    "\n",
    "# How does it do on the training data?\n",
    "print(\"Training score: (stuff it already knows)\", dummy_clf.score(X_train, y_train))\n",
    "\n",
    "# How does it do on the testing data?\n",
    "print(\"Testing score: (stuff it hasn't seen before):\", dummy_clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is a Decision Tree with every single feature from the CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is a Decision Tree\n",
      "CPU times: user 15.4 s, sys: 340 ms, total: 15.8 s\n",
      "Wall time: 19.7 s\n",
      "Training score: (stuff it already knows) 0.999780005657\n",
      "Testing score: (stuff it hasn't seen before): 0.638592080453\n"
     ]
    }
   ],
   "source": [
    "print(\"This is a Decision Tree\")\n",
    "\n",
    "from sklearn import tree\n",
    "tree_clf = tree.DecisionTreeClassifier()\n",
    "\n",
    "%time tree_clf.fit(X_train, y_train)\n",
    "\n",
    "# How does it do on the training data?\n",
    "print(\"Training score: (stuff it already knows)\", tree_clf.score(X_train, y_train))\n",
    "\n",
    "# How does it do on the testing data?\n",
    "print(\"Testing score: (stuff it hasn't seen before):\", tree_clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is a Random Forest\n",
      "CPU times: user 10 s, sys: 288 ms, total: 10.3 s\n",
      "Wall time: 13.6 s\n",
      "Training score: (stuff it already knows) 0.992645903391\n",
      "Testing score: (stuff it hasn't seen before): 0.706096794469\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "print(\"This is a Random Forest\")\n",
    "\n",
    "tree_clf = RandomForestClassifier()\n",
    "\n",
    "%time tree_clf.fit(X_train, y_train)\n",
    "\n",
    "# How does it do on the training data?\n",
    "print(\"Training score: (stuff it already knows)\", tree_clf.score(X_train, y_train))\n",
    "\n",
    "# How does it do on the testing data?\n",
    "print(\"Testing score: (stuff it hasn't seen before):\", tree_clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How do you do this in the real world with new data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "every_single_word_features = vectorizer.transform(df['ingredient_list'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4x3000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 35 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 207,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import the Naive bayes thing\n",
    "from sklearn import naive_bayes\n",
    "clf = naive_bayes.BernoulliNB()\n",
    "\n",
    "# Give the classifier EVERYTHING we know, not holding back anything\n",
    "clf.fit(every_single_word_features, df['cuisine_label'])\n",
    "\n",
    "# We have some new stuff we have not categorized\n",
    "incoming_recipes = [\n",
    "    \"spaghetti tomato sauce garlic onion water\",\n",
    "    \"soy sauce ginger sugar butter\",\n",
    "    \"green papaya thai chilies palm sugar\",\n",
    "    \"butter oil salt black pepper water milk bubblegumpie\"\n",
    "]\n",
    "\n",
    "features_for_new_recipes = vectorizer.transform(incoming_recipes)\n",
    "features_for_new_recipes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 4, 11,  4, 16])"
      ]
     },
     "execution_count": 208,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions = clf.predict(features_for_new_recipes)\n",
    "predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 209,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)"
      ]
     },
     "execution_count": 209,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The predictions are all categories that the labelencoder decided on\n",
    "# Let's convert those numeric ones back into real fun cuisine words\n",
    "le.inverse_transform(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}