Every solution to missing data problems in pandas
import pandas as pd
= pd.read_csv("potato-tweets.csv")
df df.head()
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | unknown | NaN | Jaiography |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
Counting missing values
df.info()
will give you the number of non-missing values (the non-null count). Up at the top it tells you there are 96 total entries, so it's up to you to do the math yourself.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sentiment 96 non-null object
1 text 93 non-null object
2 user 89 non-null object
dtypes: object(3)
memory usage: 2.4+ KB
For the user
column (the person's username), 96 total rows minus 89 existing usernames = 7 rows missing usernames. I find this hard to follow, though, so here's another option:
sum() df.isna().
sentiment 0
text 3
user 7
dtype: int64
Why does this work?
df.isna()
gives you True
where a cell is missing data, and False
where it is not missing data. It looks like this:
df.isna()
sentiment | text | user | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | True | False |
3 | False | False | True |
4 | False | False | False |
... | ... | ... | ... |
91 | False | False | False |
92 | False | False | False |
93 | False | False | True |
94 | False | False | False |
95 | False | False | False |
96 rows ร 3 columns
We can see row 2 is missing text, while rows 3
and 93
are missing a user. When you connect this to .sum()
, pandas will add up all of the True
values in each row.
sum() df.isna().
sentiment 0
text 3
user 7
dtype: int64
Finding rows with missing data
Although missing data is usually thought of as a bad thing, sometimes you want to find the rows with missing data.
= pd.read_csv("potato-tweets.csv")
df df.head()
The one thing that will not work is trying to do == "NaN"
. That's because NaN
is not a string, it's a special kind of pandas/numpy data that means "this is missing."
# This won't find you any users
== 'NaN'] df[df.user
sentiment | text | user |
---|
Instead, to find rows where the data is missing you need to use df.user.isna()
(...or df.user.isnull()
, which does the exact same thing).
df[df.user.isna()]
sentiment | text | user | |
---|---|---|---|
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
6 | unknown | Potatoes and eggs are just ๐๐ฝ๐๐ฝ | NaN |
8 | neutral | @LucyENichol โI want to be Smiths crispsโ sang... | NaN |
11 | unknown | @iamjaxteller kaรง tradede bir likidite oluyorsun? | NaN |
16 | neutral | @DebraGetsPilled\nClamps ๐\n\nhttps://t.co/uuW... | NaN |
23 | neutral | No offense to all the potatoes , don't get me ... | NaN |
93 | neutral | 63 Potatoes | NaN |
Find rows with ANY missing data
Another option is to find rows with ANY missing data. In this case, we have a few options. We'll start with df.isna()
so we can look at what the missing data looks like.
df.isna().head()
sentiment | text | user | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | True | False |
3 | False | False | True |
4 | False | False | False |
If we have True
and False
, we can ask them "hey, for each row, are ANY of you missing data?" by using .any(axis=1)
. The axis=1
part means "do this for each row instead of each column."
any(axis=1).head() df.isna().
0 False
1 False
2 True
3 True
4 False
dtype: bool
And now we can feed that into df[...]
to only see the rows with missing data.
any(axis=1)] df[df.isna().
sentiment | text | user | |
---|---|---|---|
2 | unknown | NaN | Jaiography |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
6 | unknown | Potatoes and eggs are just ๐๐ฝ๐๐ฝ | NaN |
8 | neutral | @LucyENichol โI want to be Smiths crispsโ sang... | NaN |
11 | unknown | @iamjaxteller kaรง tradede bir likidite oluyorsun? | NaN |
16 | neutral | @DebraGetsPilled\nClamps ๐\n\nhttps://t.co/uuW... | NaN |
19 | neutral | NaN | OdsRasheed |
23 | neutral | No offense to all the potatoes , don't get me ... | NaN |
48 | unknown | NaN | i_cry_potatoes |
93 | neutral | 63 Potatoes | NaN |
Removing rows with missing data
Maybe the rows with missing data are getting in the way of your analysis. Let's remove them!
import pandas as pd
= pd.read_csv("potato-tweets.csv")
df df.head()
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | unknown | NaN | Jaiography |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
Remove rows based on all columns
By default, you'll be removing rows where ANY of the columns are missing. I like to print the size of our data before and after dropping missing data to stay cognizant of whether we've made a mistake or whether we're throwing away a large chunk of our data.
print("Before dropping data we were", df.shape)
= df.dropna()
dropped print("After dropping data we are", dropped.shape)
dropped.head()
Before dropping data we were (96, 3)
After dropping data we are (86, 3)
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
7 | neutral | RT @HalflingDancer: B/W and Fighter proceed to... | Presto_Magician |
In this case I removed all rows that included any missing data. I went from 96 total columns down to 86.
Remove rows based on specific columns
Usually you don't want to be so aggressive, though, and instead you only want to drop rows if they're missing data in specific columns. For instance, maybe I only want to remove rows that are missing tweet text.
print("Before dropping data we were", df.shape)
= df.dropna(subset=['text'])
dropped print("After dropping data we are", dropped.shape)
dropped.head()
Before dropping data we were (96, 3)
After dropping data we are (93, 3)
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
This time .dropna()
only removed three rows! You can also pass a list to subset=
, for example in this case I could do subset=['text', 'user']
to pay attention to both the text
and user
columns.
Converting values to missing
In pandas, NaN
is missing data. We can see from the dataset below that row 2 is missing text while row 3 is missing a username.
= pd.read_csv("potato-tweets.csv")
df df.head()
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | unknown | NaN | Jaiography |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
They are NaN
beacuse there's nothing in that cell when you read in the time. Sometimes you want other values to be counted as NaN
, though! Let's look at sentiment.
df.sentiment.value_counts()
neutral 42
positive 23
negative 17
unknown 14
Name: sentiment, dtype: int64
The value of "unknown" isn't very helpful to us, and NaN
usually is a good equivalent for missing or unknown data. If we want to replace "unknown" with NaN
, we have two options.
Converting missing data with .replace
First, we can could .replace
to replace all unknowns with NaN
. In this case, we need to import the numpy library to talk about NaN
(the string "NaN"
isn't the same as actual missing-data NaN
!).
import numpy as np
= df.sentiment.replace("unknown", np.nan)
df.sentiment df.sentiment.value_counts()
neutral 42
positive 23
negative 17
Name: sentiment, dtype: int64
The unknown values have now disappeared! We can confirm they're still there by adding dropna=False
to our value counts...
=False) df.sentiment.value_counts(dropna
neutral 42
positive 23
negative 17
NaN 14
Name: sentiment, dtype: int64
...or by just looking at our dataset.
df.head()
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | NaN | NaN | Jaiography |
3 | NaN | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
Reading in missing data with na_values
Another alternative is to go back to when we read the data in. Using read_csv
, you can list values you want pandas to consider as NaN
when reading in the dataset.
import pandas as pd
= pd.read_csv("potato-tweets.csv", na_values=["unknown"])
df df.head()
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | NaN | NaN | Jaiography |
3 | NaN | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
This time every time "unknown" shows up it's automatically converted to NaN
by pandas! In this case we hope there's no tweet by a user named "unknown" and no one who just tweeted the text "unknown," but we'll take that risk...
How to fix .str.contains says "Cannot mask with non-boolean array containing NA / NaN values"
If we only want tweets about mashed potatoes, we end up in trouble! We get the error Cannot mask with non-boolean array containing NA / NaN values
str.contains("mashed")] df[df.text.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_5991/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3446
3447 # Do we have a (boolean) 1d indexer?
-> 3448 if com.is_bool_indexer(key):
3449 return self._getitem_bool_array(key)
3450
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
137 # Don't raise on e.g. ["A", "B", np.nan], see
138 # test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139 raise ValueError(na_msg)
140 return False
141 return True
ValueError: Cannot mask with non-boolean array containing NA / NaN values
This is because .str.contains
does not work with columns that are missing data. Or at least, you need to tell it what to do. In this case, when we have a tweet that is missing text, we want to count that as False
.
str.contains("mashed", na=False)] df[df.text.
sentiment | text | user | |
---|---|---|---|
9 | neutral | kurutau mashed potatoes append | KuruC_ebooks |
21 | positive | @InfernoMeaCulpa โWhatโs not to understand. So... | villainousbvtch |
27 | positive | RT @AriMelber: Are mashed potatoes really โwor... | v_vossie |
35 | negative | @FanSidedNHL Some dude tried to do that to me ... | RogueChristLord |
40 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | LurkerWojox |
53 | positive | RT @AriMelber: Are mashed potatoes really โwor... | lise_latulippe |
60 | negative | Last year I made; mashed potatoes, baked chick... | AshleyDavene |
63 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | masayuki__san |
78 | neutral | @AriMelber How often are you all eating these ... | PeachValleyView |
80 | negative | @AriMelber Mine are๐\n\nI know my granny's sec... | LockUpTrumpNow |
81 | neutral | Go head, put some truffle on your mashed potatoes | MeechiiMeech |
Maybe there are some instances where want to include rows that are missing the data, but 99% of the time you'll be using .str.contains(..., na=False)
.