Text and opening files and formatting strings
Opening files
Opening Excel files
df = pd.read_excel("../filename.xlsx")
Opening Excel files to specific sheets
We didn’t talk about this in class, but sometimes you have an excel file with multiple sheets. If you don’t want the “default” sheet, you’ll need to specify the sheet name with sheetname
.
df = pd.read_excel("filename.xlsx", sheetname="Summary Data")
Ignoring the first few rows
Sometimes the first few rows of a CSV or Excel file aren’t data, they’re notes or titles or even the data dictionary! To get around this, use skiprows=
when reading in your file.
df = pd.read_csv("something.csv", skiprows=10)
You can skip rows in Excel files, too!
df = pd.read_excel("something.xlsx", skiprows=10)
Changing how pandas displays things
You have the ability to set a lot of different display options in pandas, here’s a list of them but some command commands are below.
Changing the number of rows displayed
Pandas doesn’t display every row of your dataframe when you do df
, it usually cuts off some in the middle with a ...
. Instead of a ...
, sometimes you want to see more of your data! You can use display.max_rows
to change the number of rows pandas displays.
# Show 100 rows at a time
pd.set_option("display.max_rows", 100)
Changing the number of columns displayed
Similar to the above, pandas won’t show every single column if it thinks you have too many. If you see a ...
in your column listings, it’s because pandas has hidden some! Fix it with the code below.
pd.set_option("display.max_columns", 50)
Changing the way floats are formatted
I hate scientific notation, and maybe you do, too! This prevents .describe()
from showing you ugly e
all over the place.
# Show floats with two decimal points after
pd.set_option("display.float_format", "{:.2f}".format)
# Show floats with zero decimal points, and a comma separator
pd.set_option("display.float_format", "{:,.0f}".format)
Beware that this will make your years ugly, but that’s life I guess!
Only selecting a few columns
While we used display.max_columns
to increase the number of columns displayed at a time, you can also have pandas only select a subset of columns.
# Only show 'age' 'height' and 'name' columns
# Even if there are hundreds more!
df[['age', 'height', 'name']]
You can even save this into a new dataframe!
Searching for strings
Dealing with NaN
/missing data when using .str.contains
Let’s say we were searching for middle names that begin with Steph
.
df['middle_name'].str.contains("^Steph")
It would be True
for Stephan
and False
for Hanami
, but what if someone didn’t have a middle name? In this case, it gives back NaN
.
This causes a big problem when selecting inside of dataframes. Basically, pandas needs to either say YES we want missing data or NO we don’t want missing data. Any time you get an error like cannot index with vector containing NA / NaN values
you need to fix this.
Luckily when you use .str.contains
, you can say “here is what I want you to do with missing data.” Usually you want to exclude missing data.
# Count missing data as "NO it doesn't start with Steph"
df['middle_name'].str.contains("^Steph", na=False)
Sometimes you want to include rows with missing data, though, so you’ll use na=True
.
# Count missing data as "YES it starts with Steph"
df['middle_name'].str.contains("^Steph", na=True)
Converting things to ints and floats
If you try to use .astype(int)
to convert something to an integer, it will not work if there are NaN
values in the column.
# Error if missing data!
df.age.astype(int)
In order to turn a column with missing data into a number, you need to do .astype(float)
.
# Works okay if missing data!
df.age.astype(float)
Usually a number is a number is a number, so floats are just as good as integers. If not, there is probably another way to work around your problem.
Histograms
Histograms vs. value counts
You never do a histogram of a value_counts(). First off, value_counts
is for categorical columns and hist
is for numeric data. Also, hist
does the counting for you!
If you ever find yourself typing .value_counts().hist()
you probably just want .hist()
, or maybe .value_counts().plot()
.
Specifying the number of bins
You can use bins
to specify the number of groups that your data is put into.
df.age.hist(bins=30)
Specifying specific bin ranges
When you use bins
, you can also give it a list of breaks. to use when dividing up your data. The only gotcha is that you have to give it an upper range, otherwise it will leave off additional data.
df.age.hist(bins=[0, 12, 18, 24, 35, 100, 200])
A sneaky way to do this automatically would use df.age.max()
.
Rolling averages
A rolling average takes an average across a window of time. Instead of just looking at the value in 1995, you might look at values from 1990-2000 to even things out a bit. For 1996 you’d look at 1991-2001, etc.
In pandas you can use df.rolling
to compute all sorts of aggregates over time - means, medians, sums, cumulative sums, etc.
df.rolling(10, on='year', min_periods=5).mean()
The code above computes a rolling mean for the dataframe df
with a 10-measurement window, using the year column to order everything. If it doesn’t find 5 measurements in a row, it just gives NaN
.
The result of df.rolling
is just a normal dataframe, so you don’t be able to do .mean().plot()
- you’ll need to actually use .plot
and select the columns you’re interested in with x='..', y='..'
.