Every solution to missing data problems in pandas

import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df.head()

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	unknown	NaN	Jaiography
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman

Counting missing values

df.info() will give you the number of non-missing values (the non-null count). Up at the top it tells you there are 96 total entries, so it's up to you to do the math yourself.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  96 non-null     object
 1   text       93 non-null     object
 2   user       89 non-null     object
dtypes: object(3)
memory usage: 2.4+ KB

For the user column (the person's username), 96 total rows minus 89 existing usernames = 7 rows missing usernames. I find this hard to follow, though, so here's another option:

df.isna().sum()

sentiment    0
text         3
user         7
dtype: int64

Why does this work?

df.isna() gives you True where a cell is missing data, and False where it is not missing data. It looks like this:

df.isna()

	sentiment	text	user
0	False	False	False
1	False	False	False
2	False	True	False
3	False	False	True
4	False	False	False
...	...	...	...
91	False	False	False
92	False	False	False
93	False	False	True
94	False	False	False
95	False	False	False

96 rows × 3 columns

We can see row 2 is missing text, while rows 3 and 93 are missing a user. When you connect this to .sum(), pandas will add up all of the True values in each row.

df.isna().sum()

sentiment    0
text         3
user         7
dtype: int64

Finding rows with missing data

Although missing data is usually thought of as a bad thing, sometimes you want to find the rows with missing data.

df = pd.read_csv("potato-tweets.csv")
df.head()

The one thing that will not work is trying to do == "NaN". That's because NaN is not a string, it's a special kind of pandas/numpy data that means "this is missing."

# This won't find you any users
df[df.user == 'NaN']

	sentiment	text	user

Instead, to find rows where the data is missing you need to use df.user.isna() (...or df.user.isnull(), which does the exact same thing).

df[df.user.isna()]

	sentiment	text	user
3	unknown	And with the potatoes done, the farm is done! ...	NaN
6	unknown	Potatoes and eggs are just 👌🏽👌🏽	NaN
8	neutral	@LucyENichol “I want to be Smiths crisps” sang...	NaN
11	unknown	@iamjaxteller kaç tradede bir likidite oluyorsun?	NaN
16	neutral	@DebraGetsPilled\nClamps 😉\n\nhttps://t.co/uuW...	NaN
23	neutral	No offense to all the potatoes , don't get me ...	NaN
93	neutral	63 Potatoes	NaN

Find rows with ANY missing data

Another option is to find rows with ANY missing data. In this case, we have a few options. We'll start with df.isna() so we can look at what the missing data looks like.

df.isna().head()

	sentiment	text	user
0	False	False	False
1	False	False	False
2	False	True	False
3	False	False	True
4	False	False	False

If we have True and False, we can ask them "hey, for each row, are ANY of you missing data?" by using .any(axis=1). The axis=1 part means "do this for each row instead of each column."

df.isna().any(axis=1).head()

0    False
1    False
2     True
3     True
4    False
dtype: bool

And now we can feed that into df[...] to only see the rows with missing data.

df[df.isna().any(axis=1)]

	sentiment	text	user
2	unknown	NaN	Jaiography
3	unknown	And with the potatoes done, the farm is done! ...	NaN
6	unknown	Potatoes and eggs are just 👌🏽👌🏽	NaN
8	neutral	@LucyENichol “I want to be Smiths crisps” sang...	NaN
11	unknown	@iamjaxteller kaç tradede bir likidite oluyorsun?	NaN
16	neutral	@DebraGetsPilled\nClamps 😉\n\nhttps://t.co/uuW...	NaN
19	neutral	NaN	OdsRasheed
23	neutral	No offense to all the potatoes , don't get me ...	NaN
48	unknown	NaN	i_cry_potatoes
93	neutral	63 Potatoes	NaN

Removing rows with missing data

Maybe the rows with missing data are getting in the way of your analysis. Let's remove them!

import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df.head()

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	unknown	NaN	Jaiography
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman

Remove rows based on all columns

By default, you'll be removing rows where ANY of the columns are missing. I like to print the size of our data before and after dropping missing data to stay cognizant of whether we've made a mistake or whether we're throwing away a large chunk of our data.

print("Before dropping data we were", df.shape)

dropped = df.dropna()
print("After dropping data we are", dropped.shape)

dropped.head()

Before dropping data we were (96, 3)
After dropping data we are (86, 3)

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman
5	positive	RT @junedarville: ❤️ 𝐃𝐚𝐮𝐩𝐡𝐢𝐧𝐨𝐢𝐬𝐞 𝐏𝐨𝐭𝐚𝐭𝐨𝐞𝐬\n❤️ ...	myphillymedia
7	neutral	RT @HalflingDancer: B/W and Fighter proceed to...	Presto_Magician

In this case I removed all rows that included any missing data. I went from 96 total columns down to 86.

Remove rows based on specific columns

Usually you don't want to be so aggressive, though, and instead you only want to drop rows if they're missing data in specific columns. For instance, maybe I only want to remove rows that are missing tweet text.

print("Before dropping data we were", df.shape)

dropped = df.dropna(subset=['text'])
print("After dropping data we are", dropped.shape)

dropped.head()

Before dropping data we were (96, 3)
After dropping data we are (93, 3)

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman
5	positive	RT @junedarville: ❤️ 𝐃𝐚𝐮𝐩𝐡𝐢𝐧𝐨𝐢𝐬𝐞 𝐏𝐨𝐭𝐚𝐭𝐨𝐞𝐬\n❤️ ...	myphillymedia

This time .dropna() only removed three rows! You can also pass a list to subset=, for example in this case I could do subset=['text', 'user'] to pay attention to both the text and user columns.

Converting values to missing

In pandas, NaN is missing data. We can see from the dataset below that row 2 is missing text while row 3 is missing a username.

df = pd.read_csv("potato-tweets.csv")
df.head()

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	unknown	NaN	Jaiography
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman

They are NaN beacuse there's nothing in that cell when you read in the time. Sometimes you want other values to be counted as NaN, though! Let's look at sentiment.

df.sentiment.value_counts()

neutral     42
positive    23
negative    17
unknown     14
Name: sentiment, dtype: int64

The value of "unknown" isn't very helpful to us, and NaN usually is a good equivalent for missing or unknown data. If we want to replace "unknown" with NaN, we have two options.

Converting missing data with .replace

First, we can could .replace to replace all unknowns with NaN. In this case, we need to import the numpy library to talk about NaN (the string "NaN" isn't the same as actual missing-data NaN!).

import numpy as np

df.sentiment = df.sentiment.replace("unknown", np.nan)
df.sentiment.value_counts()

neutral     42
positive    23
negative    17
Name: sentiment, dtype: int64

The unknown values have now disappeared! We can confirm they're still there by adding dropna=False to our value counts...

df.sentiment.value_counts(dropna=False)

neutral     42
positive    23
negative    17
NaN         14
Name: sentiment, dtype: int64

...or by just looking at our dataset.

df.head()

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	NaN	NaN	Jaiography
3	NaN	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman

Reading in missing data with na_values

Another alternative is to go back to when we read the data in. Using read_csv, you can list values you want pandas to consider as NaN when reading in the dataset.

import pandas as pd

df = pd.read_csv("potato-tweets.csv", na_values=["unknown"])
df.head()

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	NaN	NaN	Jaiography
3	NaN	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman

This time every time "unknown" shows up it's automatically converted to NaN by pandas! In this case we hope there's no tweet by a user named "unknown" and no one who just tweeted the text "unknown," but we'll take that risk...

How to fix .str.contains says "Cannot mask with non-boolean array containing NA / NaN values"

If we only want tweets about mashed potatoes, we end up in trouble! We get the error Cannot mask with non-boolean array containing NA / NaN values

df[df.text.str.contains("mashed")]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_5991/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3446 
   3447         # Do we have a (boolean) 1d indexer?
-> 3448         if com.is_bool_indexer(key):
   3449             return self._getitem_bool_array(key)
   3450 

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
    137                     # Don't raise on e.g. ["A", "B", np.nan], see
    138                     #  test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139                     raise ValueError(na_msg)
    140                 return False
    141             return True

ValueError: Cannot mask with non-boolean array containing NA / NaN values

This is because .str.contains does not work with columns that are missing data. Or at least, you need to tell it what to do. In this case, when we have a tweet that is missing text, we want to count that as False.

df[df.text.str.contains("mashed", na=False)]

	sentiment	text	user
9	neutral	kurutau mashed potatoes append	KuruC_ebooks
21	positive	@InfernoMeaCulpa “What’s not to understand. So...	villainousbvtch
27	positive	RT @AriMelber: Are mashed potatoes really “wor...	v_vossie
35	negative	@FanSidedNHL Some dude tried to do that to me ...	RogueChristLord
40	neutral	RT @fatfatpankocat: Heaping pile of mashed pot...	LurkerWojox
53	positive	RT @AriMelber: Are mashed potatoes really “wor...	lise_latulippe
60	negative	Last year I made; mashed potatoes, baked chick...	AshleyDavene
63	neutral	RT @fatfatpankocat: Heaping pile of mashed pot...	masayuki__san
78	neutral	@AriMelber How often are you all eating these ...	PeachValleyView
80	negative	@AriMelber Mine are😉\n\nI know my granny's sec...	LockUpTrumpNow
81	neutral	Go head, put some truffle on your mashed potatoes	MeechiiMeech

Maybe there are some instances where want to include rows that are missing the data, but 99% of the time you'll be using .str.contains(..., na=False).

← back to class-06