Debugging astype errors in pandas¶
In this notebook, we'll look at a solution to the pandas error message ValueError: invalid literal for int() with base 10
The problem¶
When cleaning up a large CSV or Excel file using pandas, you often need to convert columns from strings to numbers. The most common approach is using .astype(int) or .astype(float).
Sometimes the data isn't perfect, though. Take a look at the dataset below - two of our amount_in_kg rows aren't actually numbers.
import pandas as pd
df = pd.read_csv("bad-data.csv")
df
| ingredient | amount_in_kg | |
|---|---|---|
| 0 | carrots | 1 |
| 1 | rice | 1.. |
| 2 | okra | NaN |
| 3 | onions | 3kg |
| 4 | red beans | 2 |
If we try to run .astype with float int, we'll end up with the error ValueError: invalid literal for int() with base 10 or ValueError: could not convert string to float: '1..'. This is because both the 1.. and 3kg values are not integers and need to be cleaned up.
df.amount_in_kg.astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2174048608.py in <cell line: 1>() ----> 1 df.amount_in_kg.astype(int) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors) 5910 else: 5911 # else, only a single dtype is given -> 5912 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 5913 return self._constructor(new_data).__finalize__(self, method="astype") 5914 ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors) 417 418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T: --> 419 return self.apply("astype", dtype=dtype, copy=copy, errors=errors) 420 421 def convert( ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 302 applied = b.apply(f, **kwargs) 303 else: --> 304 applied = getattr(b, f)(**kwargs) 305 except (TypeError, NotImplementedError): 306 if not ignore_failures: ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors) 578 values = self.values 579 --> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 581 582 new_values = maybe_coerce_values(new_values) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors) 1290 1291 try: -> 1292 new_values = astype_array(values, dtype, copy=copy) 1293 except (ValueError, TypeError): 1294 # e.g. astype_nansafe can fail on object-dtype of strings ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy) 1235 1236 else: -> 1237 values = astype_nansafe(values, dtype, copy=copy) 1238 1239 # in pandas we don't store numpy str dtypes, so convert to object ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 1152 # work around NumPy brokenness, #1987 1153 if np.issubdtype(dtype.type, np.integer): -> 1154 return lib.astype_intsafe(arr, dtype) 1155 1156 # if we have a datetime/timedelta array of objects ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() ValueError: invalid literal for int() with base 10: '1..'
df.amount_in_kg.astype(float)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2789198030.py in <cell line: 1>() ----> 1 df.amount_in_kg.astype(float) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors) 5910 else: 5911 # else, only a single dtype is given -> 5912 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 5913 return self._constructor(new_data).__finalize__(self, method="astype") 5914 ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors) 417 418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T: --> 419 return self.apply("astype", dtype=dtype, copy=copy, errors=errors) 420 421 def convert( ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 302 applied = b.apply(f, **kwargs) 303 else: --> 304 applied = getattr(b, f)(**kwargs) 305 except (TypeError, NotImplementedError): 306 if not ignore_failures: ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors) 578 values = self.values 579 --> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 581 582 new_values = maybe_coerce_values(new_values) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors) 1290 1291 try: -> 1292 new_values = astype_array(values, dtype, copy=copy) 1293 except (ValueError, TypeError): 1294 # e.g. astype_nansafe can fail on object-dtype of strings ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy) 1235 1236 else: -> 1237 values = astype_nansafe(values, dtype, copy=copy) 1238 1239 # in pandas we don't store numpy str dtypes, so convert to object ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 1179 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype): 1180 # Explicit copy, or required since NumPy can't view from / to object. -> 1181 return arr.astype(dtype, copy=True) 1182 1183 return arr.astype(dtype, copy=copy) ValueError: could not convert string to float: '1..'
It's easy enough for us to see and fix the issue with this small data, but what if we have a larger dataset? We'll need to just select the rows with bad data for further inspection.
Finding the bad data¶
Instead of using .astype(int), we're going to convert our column with pd.to_numeric. It's a similar command with a very useful difference.
When .astype(int) encounters an error, it has two options:
- Raise an error
- Ignore the error and return the original data
pd.to_numeric has three options:
- Raise an error
- Ignore the error and return the original data
- Convert the "bad" data to
NaN
We'll take advantage of that last option! To find the bad data, we can use errors='coerce' to turn error values into NaN...
# The first and last rows were converted
pd.to_numeric(df.amount_in_kg, errors='coerce')
0 1.0 1 NaN 2 NaN 3 NaN 4 2.0 Name: amount_in_kg, dtype: float64
...and then use those NaN values to filter for all the rows with an unsuccessful conversion.
df[pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
| ingredient | amount_in_kg | |
|---|---|---|
| 1 | rice | 1.. |
| 2 | okra | NaN |
| 3 | onions | 3kg |
In our case, okra originally had missing data for amount_in_kg, it isn't the result of an error in converting to a number. If we don't want it to show up, we can also filter for places where the original value was not missing.
df[df.amount_in_kg.notna() & pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
| ingredient | amount_in_kg | |
|---|---|---|
| 1 | rice | 1.. |
| 3 | onions | 3kg |
Fixing the bad data¶
Now that we can see the error rows, we know exactly what values we need to replace in order to fix up our analysis!
df.amount_in_kg = df.amount_in_kg \
.str.replace("..", "", regex=False) \
.str.replace("kg", "") \
.astype(float)
df
| ingredient | amount_in_kg | |
|---|---|---|
| 0 | carrots | 1.0 |
| 1 | rice | 1.0 |
| 2 | okra | NaN |
| 3 | onions | 3.0 |
| 4 | red beans | 2.0 |