Text and opening files and formatting strings

Opening files

Opening Excel files

df = pd.read_excel("../filename.xlsx")

Opening Excel files to specific sheets

We didn’t talk about this in class, but sometimes you have an excel file with multiple sheets. If you don’t want the “default” sheet, you’ll need to specify the sheet name with sheetname.

df = pd.read_excel("filename.xlsx", sheetname="Summary Data")

Ignoring the first few rows

Sometimes the first few rows of a CSV or Excel file aren’t data, they’re notes or titles or even the data dictionary! To get around this, use skiprows= when reading in your file.

df = pd.read_csv("something.csv", skiprows=10)

You can skip rows in Excel files, too!

df = pd.read_excel("something.xlsx", skiprows=10)

Changing how pandas displays things

You have the ability to set a lot of different display options in pandas, here’s a list of them but some command commands are below.

Changing the number of rows displayed

Pandas doesn’t display every row of your dataframe when you do df, it usually cuts off some in the middle with a .... Instead of a ..., sometimes you want to see more of your data! You can use display.max_rows to change the number of rows pandas displays.

# Show 100 rows at a time
pd.set_option("display.max_rows", 100)

Changing the number of columns displayed

Similar to the above, pandas won’t show every single column if it thinks you have too many. If you see a ... in your column listings, it’s because pandas has hidden some! Fix it with the code below.

pd.set_option("display.max_columns", 50)

Changing the way floats are formatted

I hate scientific notation, and maybe you do, too! This prevents .describe() from showing you ugly e all over the place.

# Show floats with two decimal points after
pd.set_option("display.float_format", "{:.2f}".format)

# Show floats with zero decimal points, and a comma separator
pd.set_option("display.float_format", "{:,.0f}".format)

Beware that this will make your years ugly, but that’s life I guess!

Only selecting a few columns

While we used display.max_columns to increase the number of columns displayed at a time, you can also have pandas only select a subset of columns.

# Only show 'age' 'height' and 'name' columns
# Even if there are hundreds more!
df[['age', 'height', 'name']]

You can even save this into a new dataframe!

Searching for strings

Dealing with `NaN`/missing data when using `.str.contains`

Let’s say we were searching for middle names that begin with Steph.

df['middle_name'].str.contains("^Steph")

It would be True for Stephan and False for Hanami, but what if someone didn’t have a middle name? In this case, it gives back NaN.

This causes a big problem when selecting inside of dataframes. Basically, pandas needs to either say YES we want missing data or NO we don’t want missing data. Any time you get an error like cannot index with vector containing NA / NaN values you need to fix this.

Luckily when you use .str.contains, you can say “here is what I want you to do with missing data.” Usually you want to exclude missing data.

# Count missing data as "NO it doesn't start with Steph"
df['middle_name'].str.contains("^Steph", na=False)

Sometimes you want to include rows with missing data, though, so you’ll use na=True.

# Count missing data as "YES it starts with Steph"
df['middle_name'].str.contains("^Steph", na=True)

Converting things to ints and floats

If you try to use .astype(int) to convert something to an integer, it will not work if there are NaN values in the column.

# Error if missing data!
df.age.astype(int)

In order to turn a column with missing data into a number, you need to do .astype(float).

# Works okay if missing data!
df.age.astype(float)

Usually a number is a number is a number, so floats are just as good as integers. If not, there is probably another way to work around your problem.

Histograms

Histograms vs. value counts

You never do a histogram of a value_counts(). First off, value_counts is for categorical columns and hist is for numeric data. Also, hist does the counting for you!

If you ever find yourself typing .value_counts().hist() you probably just want .hist(), or maybe .value_counts().plot().

Specifying the number of bins

You can use bins to specify the number of groups that your data is put into.

df.age.hist(bins=30)

Specifying specific bin ranges

When you use bins, you can also give it a list of breaks. to use when dividing up your data. The only gotcha is that you have to give it an upper range, otherwise it will leave off additional data.

df.age.hist(bins=[0, 12, 18, 24, 35, 100, 200])

A sneaky way to do this automatically would use df.age.max().

Rolling averages

A rolling average takes an average across a window of time. Instead of just looking at the value in 1995, you might look at values from 1990-2000 to even things out a bit. For 1996 you’d look at 1991-2001, etc.

In pandas you can use df.rolling to compute all sorts of aggregates over time - means, medians, sums, cumulative sums, etc.

df.rolling(10, on='year', min_periods=5).mean()

The code above computes a rolling mean for the dataframe df with a 10-measurement window, using the year column to order everything. If it doesn’t find 5 measurements in a row, it just gives NaN.

The result of df.rolling is just a normal dataframe, so you don’t be able to do .mean().plot() - you’ll need to actually use .plot and select the columns you’re interested in with x='..', y='..'.