How to use all of str.contains options
import pandas as pd
pd.set_option("display.max_rows", 10)
pd.set_option("display.min_rows", 10)import pandas as pd
df = pd.read_csv("potato-tweets.csv")
df| sentiment | text | user | |
|---|---|---|---|
| 0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
| 1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
| 2 | unknown | NaN | Jaiography |
| 3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
| 4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
| ... | ... | ... | ... |
| 91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
| 92 | unknown | I like potatoes | harrywlc |
| 93 | neutral | 63 Potatoes | NaN |
| 94 | unknown | @MeganReports Carrots are great - I grew up wi... | JustinReady |
| 95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
96 rows ร 3 columns
Using str.contains with missing data
By default, .str.contains has a panic attack if you try to use it in a column where you are missing data.
df[df.text.str.contains("mashed")]---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_34648/1924286915.py in <module>
----> 1 df[df.text.str.contains("mashed")]
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3446
3447 # Do we have a (boolean) 1d indexer?
-> 3448 if com.is_bool_indexer(key):
3449 return self._getitem_bool_array(key)
3450
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py in is_bool_indexer(key)
137 # Don't raise on e.g. ["A", "B", np.nan], see
138 # test_loc_getitem_list_of_labels_categoricalindex_with_na
--> 139 raise ValueError(na_msg)
140 return False
141 return True
ValueError: Cannot mask with non-boolean array containing NA / NaN values
If you try to use .str.contains to search for text in a column with missing data, you get the error Cannot mask with non-boolean array containing NA / NaN values. When this happens, just tell .str.contains that when it sees missing data, count the missing data as False.
df[df.text.str.contains("mashed", na=False)]| sentiment | text | user | |
|---|---|---|---|
| 9 | neutral | kurutau mashed potatoes append | KuruC_ebooks |
| 21 | positive | @InfernoMeaCulpa โWhatโs not to understand. So... | villainousbvtch |
| 27 | positive | RT @AriMelber: Are mashed potatoes really โwor... | v_vossie |
| 35 | negative | @FanSidedNHL Some dude tried to do that to me ... | RogueChristLord |
| 40 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | LurkerWojox |
| ... | ... | ... | ... |
| 60 | negative | Last year I made; mashed potatoes, baked chick... | AshleyDavene |
| 63 | neutral | RT @fatfatpankocat: Heaping pile of mashed pot... | masayuki__san |
| 78 | neutral | @AriMelber How often are you all eating these ... | PeachValleyView |
| 80 | negative | @AriMelber Mine are๐\n\nI know my granny's sec... | LockUpTrumpNow |
| 81 | neutral | Go head, put some truffle on your mashed potatoes | MeechiiMeech |
11 rows ร 3 columns
Is str.contains case-sensitive?
By default, .str.contains uses exact case matching. That means if we search for uppercase letters, it will only show me uppercase letter matches.
df[df.text.str.contains("POTATO", na=False)]| sentiment | text | user | |
|---|---|---|---|
| 72 | neutral | TANJIA GAVE ME SEED POTATOES :((( | rcmmel |
| 91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
If we want .str.contains to not be case-sensitive, we can pass case=False to it.
df[df.text.str.contains("POTATO", na=False, case=False)]| sentiment | text | user | |
|---|---|---|---|
| 0 | positive | Variety is the spice of life, and that's why w... | nojolondon |
| 1 | neutral | la ptite frite dans les potatoes๐๐๐๐๐๐๐ | 8LU3H0UR |
| 3 | unknown | And with the potatoes done, the farm is done! ... | NaN |
| 4 | neutral | @AlacritysWhatev @AriMelber As is the gravy ma... | adivawoman |
| 5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
| ... | ... | ... | ... |
| 90 | positive | RT @green_pills2021: Crunchy, healthy and supe... | LucaMatteoRosso |
| 91 | neutral | Soviet kids made toys from POTATOES!ย (PICS) ht... | therussophile |
| 92 | unknown | I like potatoes | harrywlc |
| 93 | neutral | 63 Potatoes | NaN |
| 95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
70 rows ร 3 columns
Regular expressions with .str.contains
Regular expressions are a fancy way of doing searches. They're special characters that mean things other than the character.
| string | meaning |
|---|---|
| .* | match anything |
| ^ | start of the text |
| $ | end of the text |
| ? | the thing before is optional |
| \d | number character (digit) |
| [ASDF] | A or S or D or F |
For example, if we only wanted tweets that started with RT...
# Searching for text that starts with RT
df[df.text.str.contains("^RT", na=False)]| sentiment | text | user | |
|---|---|---|---|
| 5 | positive | RT @junedarville: โค๏ธ ๐๐๐ฎ๐ฉ๐ก๐ข๐ง๐จ๐ข๐ฌ๐ ๐๐จ๐ญ๐๐ญ๐จ๐๐ฌ\nโค๏ธ ... | myphillymedia |
| 7 | neutral | RT @HalflingDancer: B/W and Fighter proceed to... | Presto_Magician |
| 14 | neutral | RT @CoralCityCamera: A manatee trio of the ten... | skippz666 |
| 17 | negative | RT @MaxCCurtis: Imagine Doctor Who: Flux from ... | aquatimelord |
| 18 | neutral | RT @DesignationSix: I would ask Anthony Walker... | Kath2252 |
| ... | ... | ... | ... |
| 70 | positive | RT @OrbitalGardens: Right, it's time to kick o... | Helenintgarden |
| 71 | negative | RT @MarshalPapworth: A little bit of #mondaymo... | HarperAdamsUni |
| 79 | neutral | RT @TestKitchen: Tag yourself, weโre Garlic Ma... | stephen40290427 |
| 90 | positive | RT @green_pills2021: Crunchy, healthy and supe... | LucaMatteoRosso |
| 95 | neutral | RT @kaiken99: He is Marcel, a creature of the ... | the_eismen |
27 rows ร 3 columns
If you wanted to turn off regular expression support for .str.contains, you can use regex=False.
# Literally searching for ^RT
df[df.text.str.contains("^RT", na=False, regex=False)]| sentiment | text | user |
|---|