โ† back to class-06

Every solution to missing data problems in pandas

import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df.head()
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 unknown NaN Jaiography
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman

Counting missing values

df.info() will give you the number of non-missing values (the non-null count). Up at the top it tells you there are 96 total entries, so it's up to you to do the math yourself.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  96 non-null     object
 1   text       93 non-null     object
 2   user       89 non-null     object
dtypes: object(3)
memory usage: 2.4+ KB

For the user column (the person's username), 96 total rows minus 89 existing usernames = 7 rows missing usernames. I find this hard to follow, though, so here's another option:

df.isna().sum()
sentiment    0
text         3
user         7
dtype: int64

Why does this work?

df.isna() gives you True where a cell is missing data, and False where it is not missing data. It looks like this:

df.isna()
sentiment text user
0 False False False
1 False False False
2 False True False
3 False False True
4 False False False
... ... ... ...
91 False False False
92 False False False
93 False False True
94 False False False
95 False False False

96 rows ร— 3 columns

We can see row 2 is missing text, while rows 3 and 93 are missing a user. When you connect this to .sum(), pandas will add up all of the True values in each row.

df.isna().sum()
sentiment    0
text         3
user         7
dtype: int64

Finding rows with missing data

Although missing data is usually thought of as a bad thing, sometimes you want to find the rows with missing data.

df = pd.read_csv("potato-tweets.csv")
df.head()

The one thing that will not work is trying to do == "NaN". That's because NaN is not a string, it's a special kind of pandas/numpy data that means "this is missing."

# This won't find you any users
df[df.user == 'NaN']
sentiment text user

Instead, to find rows where the data is missing you need to use df.user.isna() (...or df.user.isnull(), which does the exact same thing).

df[df.user.isna()]
sentiment text user
3 unknown And with the potatoes done, the farm is done! ... NaN
6 unknown Potatoes and eggs are just ๐Ÿ‘Œ๐Ÿฝ๐Ÿ‘Œ๐Ÿฝ NaN
8 neutral @LucyENichol โ€œI want to be Smiths crispsโ€ sang... NaN
11 unknown @iamjaxteller kaรง tradede bir likidite oluyorsun? NaN
16 neutral @DebraGetsPilled\nClamps ๐Ÿ˜‰\n\nhttps://t.co/uuW... NaN
23 neutral No offense to all the potatoes , don't get me ... NaN
93 neutral 63 Potatoes NaN

Find rows with ANY missing data

Another option is to find rows with ANY missing data. In this case, we have a few options. We'll start with df.isna() so we can look at what the missing data looks like.

df.isna().head()
sentiment text user
0 False False False
1 False False False
2 False True False
3 False False True
4 False False False

If we have True and False, we can ask them "hey, for each row, are ANY of you missing data?" by using .any(axis=1). The axis=1 part means "do this for each row instead of each column."

df.isna().any(axis=1).head()
0    False
1    False
2     True
3     True
4    False
dtype: bool

And now we can feed that into df[...] to only see the rows with missing data.

df[df.isna().any(axis=1)]
sentiment text user
2 unknown NaN Jaiography
3 unknown And with the potatoes done, the farm is done! ... NaN
6 unknown Potatoes and eggs are just ๐Ÿ‘Œ๐Ÿฝ๐Ÿ‘Œ๐Ÿฝ NaN
8 neutral @LucyENichol โ€œI want to be Smiths crispsโ€ sang... NaN
11 unknown @iamjaxteller kaรง tradede bir likidite oluyorsun? NaN
16 neutral @DebraGetsPilled\nClamps ๐Ÿ˜‰\n\nhttps://t.co/uuW... NaN
19 neutral NaN OdsRasheed
23 neutral No offense to all the potatoes , don't get me ... NaN
48 unknown NaN i_cry_potatoes
93 neutral 63 Potatoes NaN

Removing rows with missing data

Maybe the rows with missing data are getting in the way of your analysis. Let's remove them!

import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df.head()
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 unknown NaN Jaiography
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman

Remove rows based on all columns

By default, you'll be removing rows where ANY of the columns are missing. I like to print the size of our data before and after dropping missing data to stay cognizant of whether we've made a mistake or whether we're throwing away a large chunk of our data.

print("Before dropping data we were", df.shape)

dropped = df.dropna()
print("After dropping data we are", dropped.shape)

dropped.head()
Before dropping data we were (96, 3)
After dropping data we are (86, 3)
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman
5 positive RT @junedarville: โค๏ธ ๐ƒ๐š๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ž ๐๐จ๐ญ๐š๐ญ๐จ๐ž๐ฌ\nโค๏ธ ... myphillymedia
7 neutral RT @HalflingDancer: B/W and Fighter proceed to... Presto_Magician

In this case I removed all rows that included any missing data. I went from 96 total columns down to 86.

Remove rows based on specific columns

Usually you don't want to be so aggressive, though, and instead you only want to drop rows if they're missing data in specific columns. For instance, maybe I only want to remove rows that are missing tweet text.

print("Before dropping data we were", df.shape)

dropped = df.dropna(subset=['text'])
print("After dropping data we are", dropped.shape)

dropped.head()
Before dropping data we were (96, 3)
After dropping data we are (93, 3)
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman
5 positive RT @junedarville: โค๏ธ ๐ƒ๐š๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ž ๐๐จ๐ญ๐š๐ญ๐จ๐ž๐ฌ\nโค๏ธ ... myphillymedia

This time .dropna() only removed three rows! You can also pass a list to subset=, for example in this case I could do subset=['text', 'user'] to pay attention to both the text and user columns.

Converting values to missing

In pandas, NaN is missing data. We can see from the dataset below that row 2 is missing text while row 3 is missing a username.

df = pd.read_csv("potato-tweets.csv")
df.head()
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 unknown NaN Jaiography
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman

They are NaN beacuse there's nothing in that cell when you read in the time. Sometimes you want other values to be counted as NaN, though! Let's look at sentiment.

df.sentiment.value_counts()
neutral     42
positive    23
negative    17
unknown     14
Name: sentiment, dtype: int64

The value of "unknown" isn't very helpful to us, and NaN usually is a good equivalent for missing or unknown data. If we want to replace "unknown" with NaN, we have two options.

Converting missing data with .replace

First, we can could .replace to replace all unknowns with NaN. In this case, we need to import the numpy library to talk about NaN (the string "NaN" isn't the same as actual missing-data NaN!).

import numpy as np

df.sentiment = df.sentiment.replace("unknown", np.nan)
df.sentiment.value_counts()
neutral     42
positive    23
negative    17
Name: sentiment, dtype: int64

The unknown values have now disappeared! We can confirm they're still there by adding dropna=False to our value counts...

df.sentiment.value_counts(dropna=False)
neutral     42
positive    23
negative    17
NaN         14
Name: sentiment, dtype: int64

...or by just looking at our dataset.

df.head()
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 NaN NaN Jaiography
3 NaN And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman

Reading in missing data with na_values

Another alternative is to go back to when we read the data in. Using read_csv, you can list values you want pandas to consider as NaN when reading in the dataset.

import pandas as pd

df = pd.read_csv("potato-tweets.csv", na_values=["unknown"])
df.head()
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 NaN NaN Jaiography
3 NaN And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman

This time every time "unknown" shows up it's automatically converted to NaN by pandas! In this case we hope there's no tweet by a user named "unknown" and no one who just tweeted the text "unknown," but we'll take that risk...

How to fix .str.contains says "Cannot mask with non-boolean array containing NA / NaN values"

If we only want tweets about mashed potatoes, we end up in trouble! We get the error Cannot mask with non-boolean array containing NA / NaN values

df[df.text.str.contains("mashed")]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_5991/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3446 
   3447         # Do we have a (boolean) 1d indexer?
-> 3448         if com.is_bool_indexer(key):
   3449             return self._getitem_bool_array(key)
   3450 

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
    137                     # Don't raise on e.g. ["A", "B", np.nan], see
    138                     #  test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139                     raise ValueError(na_msg)
    140                 return False
    141             return True

ValueError: Cannot mask with non-boolean array containing NA / NaN values

This is because .str.contains does not work with columns that are missing data. Or at least, you need to tell it what to do. In this case, when we have a tweet that is missing text, we want to count that as False.

df[df.text.str.contains("mashed", na=False)]
sentiment text user
9 neutral kurutau mashed potatoes append KuruC_ebooks
21 positive @InfernoMeaCulpa โ€œWhatโ€™s not to understand. So... villainousbvtch
27 positive RT @AriMelber: Are mashed potatoes really โ€œwor... v_vossie
35 negative @FanSidedNHL Some dude tried to do that to me ... RogueChristLord
40 neutral RT @fatfatpankocat: Heaping pile of mashed pot... LurkerWojox
53 positive RT @AriMelber: Are mashed potatoes really โ€œwor... lise_latulippe
60 negative Last year I made; mashed potatoes, baked chick... AshleyDavene
63 neutral RT @fatfatpankocat: Heaping pile of mashed pot... masayuki__san
78 neutral @AriMelber How often are you all eating these ... PeachValleyView
80 negative @AriMelber Mine are๐Ÿ˜‰\n\nI know my granny's sec... LockUpTrumpNow
81 neutral Go head, put some truffle on your mashed potatoes MeechiiMeech

Maybe there are some instances where want to include rows that are missing the data, but 99% of the time you'll be using .str.contains(..., na=False).

โ† back to class-06