Replacing with .str.replace and .replace
import pandas as pd
= pd.DataFrame([
df 'original': 'Potatoes', 'sentiment': -1 },
{ 'original': 'I hate bananas', 'sentiment': -1 },
{ 'original': 'I love potatoes', 'sentiment': 1 },
{ 'original': 'Potatoes are my favorite', 'sentiment': 1 },
{ 'original': 'I ate potatoes', 'sentiment': 0 }
{ ])
Using .replace to replace exact values
When you use .replace
, you're matching the exact value (aka the entire cell).
'edited'] = df.sentiment.replace(-1, "negative")
df[ dfa
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | negative |
1 | I hate bananas | -1 | negative |
2 | I love potatoes | 1 | 1 |
3 | Potatoes are my favorite | 1 | 1 |
4 | I ate potatoes | 0 | 0 |
Using .replace to replace multiple exact values
You can also ask .replace
to replace multiple exact values by passing it a dictionary.
'edited'] = df.sentiment.replace({
df[-1: "negative",
0: "neutral",
1: "positive"
}) df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | negative |
1 | I hate bananas | -1 | negative |
2 | I love potatoes | 1 | positive |
3 | Potatoes are my favorite | 1 | positive |
4 | I ate potatoes | 0 | neutral |
Comparing .replace and .str.replace
Both .replace
and .str.replace
replace things in your data. The difference is that .replace
looks at the entire cell, while .str.replace
looks for matches inside of the cell.
Let's see some examples.
df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | negative |
1 | I hate bananas | -1 | negative |
2 | I love potatoes | 1 | positive |
3 | Potatoes are my favorite | 1 | positive |
4 | I ate potatoes | 0 | neutral |
'edited'] = df.original.replace("Potatoes", "Chocolates")
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Chocolates |
1 | I hate bananas | -1 | I hate bananas |
2 | I love potatoes | 1 | I love potatoes |
3 | Potatoes are my favorite | 1 | Potatoes are my favorite |
4 | I ate potatoes | 0 | I ate potatoes |
.replace
will only replace "Potatoes" if it finds an exact match. Notice how "Potatoes are my favorite" is untouched, but the first row changed from Potatoes to Chocolcates.
'edited'] = df.original.str.replace("Potatoes", "Chocolate")
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Chocolate |
1 | I hate bananas | -1 | I hate bananas |
2 | I love potatoes | 1 | I love potatoes |
3 | Potatoes are my favorite | 1 | Chocolate are my favorite |
4 | I ate potatoes | 0 | I ate potatoes |
.str.replace
will replace "Potatoes" even inside of a sentence. Notice how the last sentence is now Chocolates are my favorite.
Making .str.replace not case sensitive
By default, both .replace
and .str.replace
are case sensitive. They need an exact match - uppercase and lowercase are treated differently.
'edited'] = df.original.str.replace("Potatoes", "Chocolate")
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Chocolate |
1 | I hate bananas | -1 | I hate bananas |
2 | I love potatoes | 1 | I love potatoes |
3 | Potatoes are my favorite | 1 | Chocolate are my favorite |
4 | I ate potatoes | 0 | I ate potatoes |
Notice how "I love potatoes" is still about potatoes and not chocolate. If you want pandas to ignore case while replacing strings, use case=False
.
'edited'] = df.original.str.replace("Potatoes", "Chocolate", case=False)
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Chocolate |
1 | I hate bananas | -1 | I hate bananas |
2 | I love potatoes | 1 | I love Chocolate |
3 | Potatoes are my favorite | 1 | Chocolate are my favorite |
4 | I ate potatoes | 0 | I ate Chocolate |
You cannot make replace case-insensitive (unless you work with regular expressions).
Removing parts of strings
If you want to remove something from a cell, use .str.replace to replace it with an empty string ""
.
'edited'] = df.original.str.replace("I ", "")
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Potatoes |
1 | I hate bananas | -1 | hate bananas |
2 | I love potatoes | 1 | love potatoes |
3 | Potatoes are my favorite | 1 | Potatoes are my favorite |
4 | I ate potatoes | 0 | ate potatoes |
This is really useful for data cleaning, especially if you don't know regular expressions.
= pd.DataFrame([
dirty 'phrase': 'Please call 555-1212 for assistance' },
{'phrase': 'Please call 332-3456 for assistance' },
{'phrase': 'Please call 123-4333 for assistance' },
{
]) dirty
phrase | |
---|---|
0 | Please call 555-1212 for assistance |
1 | Please call 332-3456 for assistance |
2 | Please call 123-4333 for assistance |
= dirty.phrase.str.replace("Please call ", "").str.replace(" for assistance", "")
dirty.phrase dirty
phrase | |
---|---|
0 | 555-1212 |
1 | 332-3456 |
2 | 123-4333 |
Don't confuse .str.replace and .replace
Even though they're very similar in many situations, sometimes you'll run into errors because you're treating one like the other.
Only replace can use dictionaries
When you use replace, you can replace multiple values at once.
'edited'] = df.sentiment.replace({
df[-1: "negative",
0: "neutral",
1: "positive"
}) df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | negative |
1 | I hate bananas | -1 | negative |
2 | I love potatoes | 1 | positive |
3 | Potatoes are my favorite | 1 | positive |
4 | I ate potatoes | 0 | neutral |
If you try to do that with .str.replace
, you get an error: replace() missing 1 required positional argument: 'repl'
. This means "You didn't tell me what to replace with what," even though it feels like you tried.
# This will not work
'edited'] = df.original.str.replace({
df["potatoes": "chocolate",
"love": "hate"
}) df
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/ipykernel_44881/629204424.py in <module>
----> 1 df['edited'] = df.original.str.replace({
2 "potatoes": "chocolate",
3 "love": "hate"
4 })
5 df
~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/strings/accessor.py in wrapper(self, *args, **kwargs)
114 )
115 raise TypeError(msg)
--> 116 return func(self, *args, **kwargs)
117
118 wrapper.__name__ = func_name
TypeError: replace() missing 1 required positional argument: 'repl'
The easiest fix is to just do your replacing one replacement at a time.
'edited'] = df.original.str.replace("potatoes", "chocolate")
df['edited'] = df.original.str.replace("love", "hate")
df[ df
original | sentiment | edited | |
---|---|---|---|
0 | Potatoes | -1 | Potatoes |
1 | I hate bananas | -1 | I hate bananas |
2 | I love potatoes | 1 | I hate potatoes |
3 | Potatoes are my favorite | 1 | Potatoes are my favorite |
4 | I ate potatoes | 0 | I ate potatoes |