Class notes on replacing values and strings
Here is everything that was written on the board for class on June 19. There will definitely be more things you need to complete the notebooks, so ask in Slack!
Opening files
Opening tab-separated files
Even if it doesn’t have the extension .tsv
, finding files that are separated with tabs instead of commas is pretty common. You can pass sep=
to read_csv
to get things read in correctly.
# You represent tab with \t since you can't actually type it
df = pd.read_csv("../filename.csv", sep="\t")
Ignoring bad lines
It isn’t good practice to ignore errors when reading in files, but if you have hundreds of thousands of lines and you’re getting unsolvable errors on a few of them, I give you permission to just skip those.
df = pd.read_csv("bad-file.csv", error_bad_lines=False)
Renaming columns on an existing dataframe
If you’re renaming the columns on a dataframe that already exists, you can use df.rename
.
# Rename the "ctgy" column to "category"
# and the "addr" column to "address"
df.rename(columns={'ctgy': 'category', 'addr': 'address'}, inplace=True)
We use inplace=True
to have the edits saved to the original dataframe.
Filtering columns
Filtering columns based on a string (na
values error)
When you use .str.contains
to filter a dataframe, you often come across an error that yells about about NaN
values. To show you why, let’s say I’m looking for Sam with .str.contains("Sam")
:
- If the column is Samuel, it gives
True
- If the column is Ambrosia, it gives
False
- If the column is empty, it gives
NaN
These NaN
values cause problems when fed into a dataframe. Instead, we need to say “hey, if the cell is empty, just give us False
”
# Match Sammy, Samuel, Sam
df[df.name.str.contains("Sam", na=False)]
Filtering columns based on a regular expression
Luckily for us, .str.contains
automatically understands regular expressions for matching. We don’t need to import or set any other parameters.
# Match all falls off of bicycles:
# - FELL OFF A BIKE
# - HAD A FALL FROM BICYCLE
# - etc
df[df.narr1.str.contains("F[AE]LL.*BI[KC]")]
Filtering columns based on multiple possible values
If you want to check a column to see if it matches one of a set of values, you can do it in an ugly bad terrible long way.
# This way is BAD and takes up too much space
df[(df.month == 'June') | (df.month == 'July') |(df.month == 'August')]
You can also just use .isin
, which is one hundred times easier.
summer = ["June", "July", "August"]
df[df.month.isin(summer)]
Replacing values
Replacing entire cells, single values
When you use .replace
, it doesn’t care about data types, it does any converting for you. For example, if 1
codes for Male
, we can replace all of them with either of the following.
# Save back into the dataframe using inplace=True
df['sex'].replace(1, "Male", inplace=True)
# Save back into the dataframe with =
df['sex'] = df['sex'].replace(1, "Male")
You can use inplace=True
if you want to save the result back into the column.
Replacing entire cells, multiple values
You can also sent .replace
a dictionary to make multiple replacements. The one below will take 1
and turn it into Male
, 2
is turned into Female
, and 0
is turned into Not Recorded
.
replacements = {
1: 'Male',
2: 'Female',
0: 'Not Recorded'
}
df['sex'].replace(replacements, inplace=True)
This is really useful when you have a big long list of codes to replace.
Replacing cells with NaN
You CAN’T just replace with "NaN"
, as that’s a string, and will cause you problems later. The “real” NaN
is from numpy, the numeric powerhouse hiding inside of pandas.
To replace all of the “Unknown” body parts with NaN
, you could use the following code.
import numpy as np
df['body_part'].replace('Unknown', np.nan)
Replacing cells based on regular expressions
In class I told you about .str.replace
, but hey, we can actually use .replace
with that!
People are very bad at spelling my first name, so we could replace it.
# Replace all of the terrible spellings of my name
df['name'].replace("J[O]H?NATH?[AO]N?", "JONATHAN", regex=True, inplace=True)
You need to make sure to specify both regex=True
to use regular expressions and inplace=True
to save it back into the column.
Replacing numbers with padded numbers
Sometimes you need leading zeroes, and sometimes those leading zeros aren’t there! For example, I might like 0000834
to be my ID number, but in the file it’s 834
and pandas read it in wrong.
# Force id_code column to be a string
df = pd.read_csv("filename.csv", dtype={'id_code': 'str'})
Sometimes it’s wrong in the file, though, and we need to forcibly add something to the left or right of the string.
# Force id_code column to be 10 characters long,
# making sure to padd it with a 0 character
# So 834 becomes 0000000834
df.id_code.str.pad(10, fillchar='0')
If the field is an integer, though, it won’t work! You can force it to be a string with .astype(str)
. And yes, even though you’re forcing it to be a string, you still need to do .str
before .pad
.
# Force id_code to be a string
# Then pad it on the left for 10 characters
# make sure it adds in a
df.id_code.astype(str).str.pad(10, fillchar='0')
Be sure to save it back into the column when you are done, it won’t happen automatically.
df['id_code'] = df.id_code.astype(str).str.pad(10, fillchar='0')
And if for some reason you want to pad on the right instead of the left, you can give it side='right'
.
Extracting from a column
Extracing from a column using regular expressions
You can use .str.extract
and regular expressions to grab part of a string. The captured part - whatever is in the ( )
- gets returned.
# If my column is "Orlando, Florida", I can
# pull out the state with a regex
# "get everything after a command and space"
df.citystate.str.extract(", (.*)")
It’s especially useful for taking part of a string and making it a new column.
df['name'] = df.greeting.str.extract("My name is (.*)")
df.head()
Very useful for addresses (cities, states, etc) and dates (year, month, etc). Dates have more “proper” ways of doing this that we’ll learn later, but this works fine, too.
Graphing and sorting
Keep missing values when doing value_counts()
By default, value_counts
ignores NaN
values. You can force it to pay attention to them, though!
df.sex.value_counts(dropna=False)
Change graph size
You can use figsize
to change the plot size, you pass it two numbers - the width and the height. They’re in inches, I think.
df.company.plot(kind='barh', figsize=(4,20))
The (4,20)
thing is just like a list, although technically it’s called a tuple. It’s different mainly because it uses parens instead of square brackets. You can probably just use a list if it helps you remember better. Or you can just memorize this and be cool and progam like everyone else!
Sorting the bars in a graph
To sort the values in a graph, you need to sort before you graph. There are two ways to sort.
Let’s say we did a value_counts()
. The thing to remember is the thing on the left is the index, and the thing on the right are the values.
# Countries and number of bears in them
df.groupby('country')['bears'].value_counts()
------------
USA 45
China 80
Italy 12
Slovakia 22
If we use .sort_values()
, it sorts by the right-hand side (the values). You can mess with ascending
if you’d like, too.
# Ordered by number of bears
df.groupby('country')['bears'].value_counts().sort_values()
------------
China 80
USA 45
Slovakia 22
Italy 12
If we use .sort_index()
, it sorts by the left-hand side (the index). You can mess with ascending
if you’d like, too.
# Ordered by country name
df.groupby('country')['bears'].value_counts().sort_index()
------------
China 80
Italy 12
Slovakia 22
USA 45
Once you’ve sorted here, then you can plot.
# Graph the countries in alphabetical order.
df.groupby('country')['bears'].value_counts().sort_index().plot()