How to use all of str.contains options

import pandas as pd

pd.set_option("display.max_rows", 10)
pd.set_option("display.min_rows", 10)

import pandas as pd

df = pd.read_csv("potato-tweets.csv")
df

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
2	unknown	NaN	Jaiography
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman
...	...	...	...
91	neutral	Soviet kids made toys from POTATOES! (PICS) ht...	therussophile
92	unknown	I like potatoes	harrywlc
93	neutral	63 Potatoes	NaN
94	unknown	@MeganReports Carrots are great - I grew up wi...	JustinReady
95	neutral	RT @kaiken99: He is Marcel, a creature of the ...	the_eismen

96 rows × 3 columns

Using str.contains with missing data

By default, .str.contains has a panic attack if you try to use it in a column where you are missing data.

df[df.text.str.contains("mashed")]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_34648/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3446 
   3447         # Do we have a (boolean) 1d indexer?
-> 3448         if com.is_bool_indexer(key):
   3449             return self._getitem_bool_array(key)
   3450 

~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
    137                     # Don't raise on e.g. ["A", "B", np.nan], see
    138                     #  test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139                     raise ValueError(na_msg)
    140                 return False
    141             return True

ValueError: Cannot mask with non-boolean array containing NA / NaN values

If you try to use .str.contains to search for text in a column with missing data, you get the error Cannot mask with non-boolean array containing NA / NaN values. When this happens, just tell .str.contains that when it sees missing data, count the missing data as False.

df[df.text.str.contains("mashed", na=False)]

	sentiment	text	user
9	neutral	kurutau mashed potatoes append	KuruC_ebooks
21	positive	@InfernoMeaCulpa “What’s not to understand. So...	villainousbvtch
27	positive	RT @AriMelber: Are mashed potatoes really “wor...	v_vossie
35	negative	@FanSidedNHL Some dude tried to do that to me ...	RogueChristLord
40	neutral	RT @fatfatpankocat: Heaping pile of mashed pot...	LurkerWojox
...	...	...	...
60	negative	Last year I made; mashed potatoes, baked chick...	AshleyDavene
63	neutral	RT @fatfatpankocat: Heaping pile of mashed pot...	masayuki__san
78	neutral	@AriMelber How often are you all eating these ...	PeachValleyView
80	negative	@AriMelber Mine are😉\n\nI know my granny's sec...	LockUpTrumpNow
81	neutral	Go head, put some truffle on your mashed potatoes	MeechiiMeech

11 rows × 3 columns

Is str.contains case-sensitive?

By default, .str.contains uses exact case matching. That means if we search for uppercase letters, it will only show me uppercase letter matches.

df[df.text.str.contains("POTATO", na=False)]

	sentiment	text	user
72	neutral	TANJIA GAVE ME SEED POTATOES :(((	rcmmel
91	neutral	Soviet kids made toys from POTATOES! (PICS) ht...	therussophile

If we want .str.contains to not be case-sensitive, we can pass case=False to it.

df[df.text.str.contains("POTATO", na=False, case=False)]

	sentiment	text	user
0	positive	Variety is the spice of life, and that's why w...	nojolondon
1	neutral	la ptite frite dans les potatoes😍😍😍😍😍😍😍	8LU3H0UR
3	unknown	And with the potatoes done, the farm is done! ...	NaN
4	neutral	@AlacritysWhatev @AriMelber As is the gravy ma...	adivawoman
5	positive	RT @junedarville: ❤️ 𝐃𝐚𝐮𝐩𝐡𝐢𝐧𝐨𝐢𝐬𝐞 𝐏𝐨𝐭𝐚𝐭𝐨𝐞𝐬\n❤️ ...	myphillymedia
...	...	...	...
90	positive	RT @green_pills2021: Crunchy, healthy and supe...	LucaMatteoRosso
91	neutral	Soviet kids made toys from POTATOES! (PICS) ht...	therussophile
92	unknown	I like potatoes	harrywlc
93	neutral	63 Potatoes	NaN
95	neutral	RT @kaiken99: He is Marcel, a creature of the ...	the_eismen

70 rows × 3 columns

Regular expressions with .str.contains

Regular expressions are a fancy way of doing searches. They're special characters that mean things other than the character.

string	meaning
.*	match anything
^	start of the text
$	end of the text
?	the thing before is optional
\d	number character (digit)
[ASDF]	A or S or D or F

For example, if we only wanted tweets that started with RT...

# Searching for text that starts with RT
df[df.text.str.contains("^RT", na=False)]

	sentiment	text	user
5	positive	RT @junedarville: ❤️ 𝐃𝐚𝐮𝐩𝐡𝐢𝐧𝐨𝐢𝐬𝐞 𝐏𝐨𝐭𝐚𝐭𝐨𝐞𝐬\n❤️ ...	myphillymedia
7	neutral	RT @HalflingDancer: B/W and Fighter proceed to...	Presto_Magician
14	neutral	RT @CoralCityCamera: A manatee trio of the ten...	skippz666
17	negative	RT @MaxCCurtis: Imagine Doctor Who: Flux from ...	aquatimelord
18	neutral	RT @DesignationSix: I would ask Anthony Walker...	Kath2252
...	...	...	...
70	positive	RT @OrbitalGardens: Right, it's time to kick o...	Helenintgarden
71	negative	RT @MarshalPapworth: A little bit of #mondaymo...	HarperAdamsUni
79	neutral	RT @TestKitchen: Tag yourself, we’re Garlic Ma...	stephen40290427
90	positive	RT @green_pills2021: Crunchy, healthy and supe...	LucaMatteoRosso
95	neutral	RT @kaiken99: He is Marcel, a creature of the ...	the_eismen

27 rows × 3 columns

If you wanted to turn off regular expression support for .str.contains, you can use regex=False.

# Literally searching for ^RT
df[df.text.str.contains("^RT", na=False, regex=False)]

	sentiment	text	user

← back to class-06