How to use all of str.contains options
import pandas as pd
"display.max_rows", 10)
pd.set_option("display.min_rows", 10) pd.set_option(
import pandas as pd
= pd.read_csv("potato-tweets.csv")
df df
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
2 | unknown | NaN | Jaiography |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
... | ... | ... | ... |
91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
92 | unknown | I like potatoes | harrywlc |
93 | neutral | 63 Potatoes | NaN |
94 | unknown | @MeganReports Carrots are great - I grew up wi... | JustinReady |
95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
96 rows ร 3 columns
Using str.contains with missing data
By default, .str.contains
has a panic attack if you try to use it in a column where you are missing data.
str.contains("mashed")] df[df.text.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_34648/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3446
3447 # Do we have a (boolean) 1d indexer?
-> 3448 if com.is_bool_indexer(key):
3449 return self._getitem_bool_array(key)
3450
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
137 # Don't raise on e.g. ["A", "B", np.nan], see
138 # test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139 raise ValueError(na_msg)
140 return False
141 return True
ValueError: Cannot mask with non-boolean array containing NA / NaN values
If you try to use .str.contains
to search for text in a column with missing data, you get the error Cannot mask with non-boolean array containing NA / NaN values
. When this happens, just tell .str.contains
that when it sees missing data, count the missing data as False.
str.contains("mashed", na=False)] df[df.text.
sentiment | text | user | |
---|---|---|---|
9 | neutral | kurutau mashed potatoes append | KuruC_ebooks |
21 | positive | @InfernoMeaCulpa โWhatโs not to understand. So... | villainousbvtch |
27 | positive | RT @AriMelber: Are mashed potatoes really โwor... | v_vossie |
35 | negative | @FanSidedNHL Some dude tried to do that to me ... | RogueChristLord |
40 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | LurkerWojox |
... | ... | ... | ... |
60 | negative | Last year I made; mashed potatoes, baked chick... | AshleyDavene |
63 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | masayuki__san |
78 | neutral | @AriMelber How often are you all eating these ... | PeachValleyView |
80 | negative | @AriMelber Mine are๐\n\nI know my granny's sec... | LockUpTrumpNow |
81 | neutral | Go head, put some truffle on your mashed potatoes | MeechiiMeech |
11 rows ร 3 columns
Is str.contains case-sensitive?
By default, .str.contains
uses exact case matching. That means if we search for uppercase letters, it will only show me uppercase letter matches.
str.contains("POTATO", na=False)] df[df.text.
sentiment | text | user | |
---|---|---|---|
72 | neutral | TANJIA GAVE ME SEED POTATOES :((( | rcmmel |
91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
If we want .str.contains
to not be case-sensitive, we can pass case=False
to it.
str.contains("POTATO", na=False, case=False)] df[df.text.
sentiment | text | user | |
---|---|---|---|
0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
... | ... | ... | ... |
90 | positive | RT @green_pills2021: Crunchy, healthy and supe... | LucaMatteoRosso |
91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
92 | unknown | I like potatoes | harrywlc |
93 | neutral | 63 Potatoes | NaN |
95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
70 rows ร 3 columns
Regular expressions with .str.contains
Regular expressions are a fancy way of doing searches. They're special characters that mean things other than the character.
string | meaning |
---|---|
.* | match anything |
^ | start of the text |
$ | end of the text |
? | the thing before is optional |
\d | number character (digit) |
[ASDF] | A or S or D or F |
For example, if we only wanted tweets that started with RT...
# Searching for text that starts with RT
str.contains("^RT", na=False)] df[df.text.
sentiment | text | user | |
---|---|---|---|
5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
7 | neutral | RT @HalflingDancer: B/W and Fighter proceed to... | Presto_Magician |
14 | neutral | RT @CoralCityCamera: A manatee trio of the ten... | skippz666 |
17 | negative | RT @MaxCCurtis: Imagine Doctor Who: Flux from ... | aquatimelord |
18 | neutral | RT @DesignationSix: I would ask Anthony Walker... | Kath2252 |
... | ... | ... | ... |
70 | positive | RT @OrbitalGardens: Right, it's time to kick o... | Helenintgarden |
71 | negative | RT @MarshalPapworth: A little bit of #mondaymo... | HarperAdamsUni |
79 | neutral | RT @TestKitchen: Tag yourself, weโre Garlic Ma... | stephen40290427 |
90 | positive | RT @green_pills2021: Crunchy, healthy and supe... | LucaMatteoRosso |
95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
27 rows ร 3 columns
If you wanted to turn off regular expression support for .str.contains
, you can use regex=False
.
# Literally searching for ^RT
str.contains("^RT", na=False, regex=False)] df[df.text.
sentiment | text | user |
---|