Class notes on replacing values and strings

Here is everything that was written on the board for class on June 19. There will definitely be more things you need to complete the notebooks, so ask in Slack!

Opening files

Opening tab-separated files

Even if it doesn’t have the extension .tsv, finding files that are separated with tabs instead of commas is pretty common. You can pass sep= to read_csv to get things read in correctly.

# You represent tab with \t since you can't actually type it
df = pd.read_csv("../filename.csv", sep="\t")

Ignoring bad lines

It isn’t good practice to ignore errors when reading in files, but if you have hundreds of thousands of lines and you’re getting unsolvable errors on a few of them, I give you permission to just skip those.

df = pd.read_csv("bad-file.csv", error_bad_lines=False)

Renaming columns on an existing dataframe

If you’re renaming the columns on a dataframe that already exists, you can use df.rename.

# Rename the "ctgy" column to "category"
# and the "addr" column to "address"
df.rename(columns={'ctgy': 'category', 'addr': 'address'}, inplace=True)

We use inplace=True to have the edits saved to the original dataframe.

Filtering columns

Filtering columns based on a string (`na` values error)

When you use .str.contains to filter a dataframe, you often come across an error that yells about about NaN values. To show you why, let’s say I’m looking for Sam with .str.contains("Sam"):

If the column is Samuel, it gives True
If the column is Ambrosia, it gives False
If the column is empty, it gives NaN

These NaN values cause problems when fed into a dataframe. Instead, we need to say “hey, if the cell is empty, just give us False”

# Match Sammy, Samuel, Sam
df[df.name.str.contains("Sam", na=False)]

Filtering columns based on a regular expression

Luckily for us, .str.contains automatically understands regular expressions for matching. We don’t need to import or set any other parameters.

# Match all falls off of bicycles:
#    - FELL OFF A BIKE
#    - HAD A FALL FROM BICYCLE
#    - etc
df[df.narr1.str.contains("F[AE]LL.*BI[KC]")]

Filtering columns based on multiple possible values

If you want to check a column to see if it matches one of a set of values, you can do it in an ugly bad terrible long way.

# This way is BAD and takes up too much space
df[(df.month == 'June') | (df.month == 'July') |(df.month == 'August')]

You can also just use .isin, which is one hundred times easier.

summer = ["June", "July", "August"]
df[df.month.isin(summer)]

Replacing values

Replacing entire cells, single values

When you use .replace, it doesn’t care about data types, it does any converting for you. For example, if 1 codes for Male, we can replace all of them with either of the following.

# Save back into the dataframe using inplace=True
df['sex'].replace(1, "Male", inplace=True)

# Save back into the dataframe with =
df['sex'] = df['sex'].replace(1, "Male")

You can use inplace=True if you want to save the result back into the column.

Replacing entire cells, multiple values

You can also sent .replace a dictionary to make multiple replacements. The one below will take 1 and turn it into Male, 2 is turned into Female, and 0 is turned into Not Recorded.

replacements = {
  1: 'Male',
  2: 'Female',
  0: 'Not Recorded'
}
df['sex'].replace(replacements, inplace=True)

This is really useful when you have a big long list of codes to replace.

Replacing cells with `NaN`

You CAN’T just replace with "NaN", as that’s a string, and will cause you problems later. The “real” NaN is from numpy, the numeric powerhouse hiding inside of pandas.

To replace all of the “Unknown” body parts with NaN, you could use the following code.

import numpy as np
df['body_part'].replace('Unknown', np.nan)

Replacing cells based on regular expressions

In class I told you about .str.replace, but hey, we can actually use .replace with that!

People are very bad at spelling my first name, so we could replace it.

# Replace all of the terrible spellings of my name
df['name'].replace("J[O]H?NATH?[AO]N?", "JONATHAN", regex=True, inplace=True)

You need to make sure to specify both regex=True to use regular expressions and inplace=True to save it back into the column.

Replacing numbers with padded numbers

Sometimes you need leading zeroes, and sometimes those leading zeros aren’t there! For example, I might like 0000834 to be my ID number, but in the file it’s 834 and pandas read it in wrong.

# Force id_code column to be a string
df = pd.read_csv("filename.csv", dtype={'id_code': 'str'})

Sometimes it’s wrong in the file, though, and we need to forcibly add something to the left or right of the string.

# Force id_code column to be 10 characters long,
# making sure to padd it with a 0 character
# So 834 becomes 0000000834
df.id_code.str.pad(10, fillchar='0')

If the field is an integer, though, it won’t work! You can force it to be a string with .astype(str). And yes, even though you’re forcing it to be a string, you still need to do .str before .pad.

# Force id_code to be a string
# Then pad it on the left for 10 characters
# make sure it adds in a 
df.id_code.astype(str).str.pad(10, fillchar='0')

Be sure to save it back into the column when you are done, it won’t happen automatically.

df['id_code'] = df.id_code.astype(str).str.pad(10, fillchar='0')

And if for some reason you want to pad on the right instead of the left, you can give it side='right'.

Extracting from a column

Extracing from a column using regular expressions

You can use .str.extract and regular expressions to grab part of a string. The captured part - whatever is in the ( ) - gets returned.

# If my column is "Orlando, Florida", I can
# pull out the state with a regex
# "get everything after a command and space"
df.citystate.str.extract(", (.*)")

It’s especially useful for taking part of a string and making it a new column.

df['name'] = df.greeting.str.extract("My name is (.*)")
df.head()

Very useful for addresses (cities, states, etc) and dates (year, month, etc). Dates have more “proper” ways of doing this that we’ll learn later, but this works fine, too.

Graphing and sorting

Keep missing values when doing `value_counts()`

By default, value_counts ignores NaN values. You can force it to pay attention to them, though!

df.sex.value_counts(dropna=False)

Change graph size

You can use figsize to change the plot size, you pass it two numbers - the width and the height. They’re in inches, I think.

df.company.plot(kind='barh', figsize=(4,20))

The (4,20) thing is just like a list, although technically it’s called a tuple. It’s different mainly because it uses parens instead of square brackets. You can probably just use a list if it helps you remember better. Or you can just memorize this and be cool and progam like everyone else!

Sorting the bars in a graph

To sort the values in a graph, you need to sort before you graph. There are two ways to sort.

Let’s say we did a value_counts(). The thing to remember is the thing on the left is the index, and the thing on the right are the values.

# Countries and number of bears in them
df.groupby('country')['bears'].value_counts()
------------
USA       45
China     80
Italy     12
Slovakia  22

If we use .sort_values(), it sorts by the right-hand side (the values). You can mess with ascending if you’d like, too.

# Ordered by number of bears
df.groupby('country')['bears'].value_counts().sort_values()
------------
China     80
USA       45
Slovakia  22
Italy     12

If we use .sort_index(), it sorts by the left-hand side (the index). You can mess with ascending if you’d like, too.

# Ordered by country name
df.groupby('country')['bears'].value_counts().sort_index()
------------
China     80
Italy     12
Slovakia  22
USA       45

Once you’ve sorted here, then you can plot.

# Graph the countries in alphabetical order.
df.groupby('country')['bears'].value_counts().sort_index().plot()