โ† back to class-06

How to use all of str.contains options

import pandas as pd

pd.set_option("display.max_rows", 10)
pd.set_option("display.min_rows", 10)
import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
2 unknown NaN Jaiography
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman
... ... ... ...
91 neutral Soviet kids made toys from POTATOES!ย (PICS) ht... therussophile
92 unknown I like potatoes harrywlc
93 neutral 63 Potatoes NaN
94 unknown @MeganReports Carrots are great - I grew up wi... JustinReady
95 neutral RT @kaiken99: He is Marcel, a creature of the ... the_eismen

96 rows ร— 3 columns

Using str.contains with missing data

By default, .str.contains has a panic attack if you try to use it in a column where you are missing data.

df[df.text.str.contains("mashed")]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_34648/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3446 
   3447         # Do we have a (boolean) 1d indexer?
-> 3448         if com.is_bool_indexer(key):
   3449             return self._getitem_bool_array(key)
   3450 

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
    137                     # Don't raise on e.g. ["A", "B", np.nan], see
    138                     #  test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139                     raise ValueError(na_msg)
    140                 return False
    141             return True

ValueError: Cannot mask with non-boolean array containing NA / NaN values

If you try to use .str.contains to search for text in a column with missing data, you get the error Cannot mask with non-boolean array containing NA / NaN values. When this happens, just tell .str.contains that when it sees missing data, count the missing data as False.

df[df.text.str.contains("mashed", na=False)]
sentiment text user
9 neutral kurutau mashed potatoes append KuruC_ebooks
21 positive @InfernoMeaCulpa โ€œWhatโ€™s not to understand. So... villainousbvtch
27 positive RT @AriMelber: Are mashed potatoes really โ€œwor... v_vossie
35 negative @FanSidedNHL Some dude tried to do that to me ... RogueChristLord
40 neutral RT @fatfatpankocat: Heaping pile of mashed pot... LurkerWojox
... ... ... ...
60 negative Last year I made; mashed potatoes, baked chick... AshleyDavene
63 neutral RT @fatfatpankocat: Heaping pile of mashed pot... masayuki__san
78 neutral @AriMelber How often are you all eating these ... PeachValleyView
80 negative @AriMelber Mine are๐Ÿ˜‰\n\nI know my granny's sec... LockUpTrumpNow
81 neutral Go head, put some truffle on your mashed potatoes MeechiiMeech

11 rows ร— 3 columns

Is str.contains case-sensitive?

By default, .str.contains uses exact case matching. That means if we search for uppercase letters, it will only show me uppercase letter matches.

df[df.text.str.contains("POTATO", na=False)]
sentiment text user
72 neutral TANJIA GAVE ME SEED POTATOES :((( rcmmel
91 neutral Soviet kids made toys from POTATOES!ย (PICS) ht... therussophile

If we want .str.contains to not be case-sensitive, we can pass case=False to it.

df[df.text.str.contains("POTATO", na=False, case=False)]
sentiment text user
0 positive Variety is the spice of life, and that's why w... nojolondon
1 neutral la ptite frite dans les potatoes๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜ 8LU3H0UR
3 unknown And with the potatoes done, the farm is done! ... NaN
4 neutral @AlacritysWhatev @AriMelber As is the gravy ma... adivawoman
5 positive RT @junedarville: โค๏ธ ๐ƒ๐š๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ž ๐๐จ๐ญ๐š๐ญ๐จ๐ž๐ฌ\nโค๏ธ ... myphillymedia
... ... ... ...
90 positive RT @green_pills2021: Crunchy, healthy and supe... LucaMatteoRosso
91 neutral Soviet kids made toys from POTATOES!ย (PICS) ht... therussophile
92 unknown I like potatoes harrywlc
93 neutral 63 Potatoes NaN
95 neutral RT @kaiken99: He is Marcel, a creature of the ... the_eismen

70 rows ร— 3 columns

Regular expressions with .str.contains

Regular expressions are a fancy way of doing searches. They're special characters that mean things other than the character.

string meaning
.* match anything
^ start of the text
$ end of the text
? the thing before is optional
\d number character (digit)
[ASDF] A or S or D or F

For example, if we only wanted tweets that started with RT...

# Searching for text that starts with RT
df[df.text.str.contains("^RT", na=False)]
sentiment text user
5 positive RT @junedarville: โค๏ธ ๐ƒ๐š๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ž ๐๐จ๐ญ๐š๐ญ๐จ๐ž๐ฌ\nโค๏ธ ... myphillymedia
7 neutral RT @HalflingDancer: B/W and Fighter proceed to... Presto_Magician
14 neutral RT @CoralCityCamera: A manatee trio of the ten... skippz666
17 negative RT @MaxCCurtis: Imagine Doctor Who: Flux from ... aquatimelord
18 neutral RT @DesignationSix: I would ask Anthony Walker... Kath2252
... ... ... ...
70 positive RT @OrbitalGardens: Right, it's time to kick o... Helenintgarden
71 negative RT @MarshalPapworth: A little bit of #mondaymo... HarperAdamsUni
79 neutral RT @TestKitchen: Tag yourself, weโ€™re Garlic Ma... stephen40290427
90 positive RT @green_pills2021: Crunchy, healthy and supe... LucaMatteoRosso
95 neutral RT @kaiken99: He is Marcel, a creature of the ... the_eismen

27 rows ร— 3 columns

If you wanted to turn off regular expression support for .str.contains, you can use regex=False.

# Literally searching for ^RT
df[df.text.str.contains("^RT", na=False, regex=False)]
sentiment text user

โ† back to class-06