Debugging astype errors in pandas¶
In this notebook, we'll look at a solution to the pandas error message ValueError: invalid literal for int() with base 10
The problem¶
When cleaning up a large CSV or Excel file using pandas, you often need to convert columns from strings to numbers. The most common approach is using .astype(int)
or .astype(float)
.
Sometimes the data isn't perfect, though. Take a look at the dataset below - two of our amount_in_kg
rows aren't actually numbers.
import pandas as pd
df = pd.read_csv("bad-data.csv")
df
ingredient | amount_in_kg | |
---|---|---|
0 | carrots | 1 |
1 | rice | 1.. |
2 | okra | NaN |
3 | onions | 3kg |
4 | red beans | 2 |
If we try to run .astype
with float
int
, we'll end up with the error ValueError: invalid literal for int() with base 10 or ValueError: could not convert string to float: '1..'. This is because both the 1..
and 3kg
values are not integers and need to be cleaned up.
df.amount_in_kg.astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2174048608.py in <cell line: 1>() ----> 1 df.amount_in_kg.astype(int) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors) 5910 else: 5911 # else, only a single dtype is given -> 5912 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 5913 return self._constructor(new_data).__finalize__(self, method="astype") 5914 ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors) 417 418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T: --> 419 return self.apply("astype", dtype=dtype, copy=copy, errors=errors) 420 421 def convert( ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 302 applied = b.apply(f, **kwargs) 303 else: --> 304 applied = getattr(b, f)(**kwargs) 305 except (TypeError, NotImplementedError): 306 if not ignore_failures: ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors) 578 values = self.values 579 --> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 581 582 new_values = maybe_coerce_values(new_values) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors) 1290 1291 try: -> 1292 new_values = astype_array(values, dtype, copy=copy) 1293 except (ValueError, TypeError): 1294 # e.g. astype_nansafe can fail on object-dtype of strings ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy) 1235 1236 else: -> 1237 values = astype_nansafe(values, dtype, copy=copy) 1238 1239 # in pandas we don't store numpy str dtypes, so convert to object ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 1152 # work around NumPy brokenness, #1987 1153 if np.issubdtype(dtype.type, np.integer): -> 1154 return lib.astype_intsafe(arr, dtype) 1155 1156 # if we have a datetime/timedelta array of objects ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() ValueError: invalid literal for int() with base 10: '1..'
df.amount_in_kg.astype(float)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2789198030.py in <cell line: 1>() ----> 1 df.amount_in_kg.astype(float) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors) 5910 else: 5911 # else, only a single dtype is given -> 5912 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 5913 return self._constructor(new_data).__finalize__(self, method="astype") 5914 ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors) 417 418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T: --> 419 return self.apply("astype", dtype=dtype, copy=copy, errors=errors) 420 421 def convert( ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 302 applied = b.apply(f, **kwargs) 303 else: --> 304 applied = getattr(b, f)(**kwargs) 305 except (TypeError, NotImplementedError): 306 if not ignore_failures: ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors) 578 values = self.values 579 --> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 581 582 new_values = maybe_coerce_values(new_values) ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors) 1290 1291 try: -> 1292 new_values = astype_array(values, dtype, copy=copy) 1293 except (ValueError, TypeError): 1294 # e.g. astype_nansafe can fail on object-dtype of strings ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy) 1235 1236 else: -> 1237 values = astype_nansafe(values, dtype, copy=copy) 1238 1239 # in pandas we don't store numpy str dtypes, so convert to object ~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 1179 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype): 1180 # Explicit copy, or required since NumPy can't view from / to object. -> 1181 return arr.astype(dtype, copy=True) 1182 1183 return arr.astype(dtype, copy=copy) ValueError: could not convert string to float: '1..'
It's easy enough for us to see and fix the issue with this small data, but what if we have a larger dataset? We'll need to just select the rows with bad data for further inspection.
Finding the bad data¶
Instead of using .astype(int)
, we're going to convert our column with pd.to_numeric
. It's a similar command with a very useful difference.
When .astype(int)
encounters an error, it has two options:
- Raise an error
- Ignore the error and return the original data
pd.to_numeric
has three options:
- Raise an error
- Ignore the error and return the original data
- Convert the "bad" data to
NaN
We'll take advantage of that last option! To find the bad data, we can use errors='coerce'
to turn error values into NaN
...
# The first and last rows were converted
pd.to_numeric(df.amount_in_kg, errors='coerce')
0 1.0 1 NaN 2 NaN 3 NaN 4 2.0 Name: amount_in_kg, dtype: float64
...and then use those NaN
values to filter for all the rows with an unsuccessful conversion.
df[pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
ingredient | amount_in_kg | |
---|---|---|
1 | rice | 1.. |
2 | okra | NaN |
3 | onions | 3kg |
In our case, okra originally had missing data for amount_in_kg
, it isn't the result of an error in converting to a number. If we don't want it to show up, we can also filter for places where the original value was not missing.
df[df.amount_in_kg.notna() & pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
ingredient | amount_in_kg | |
---|---|---|
1 | rice | 1.. |
3 | onions | 3kg |
Fixing the bad data¶
Now that we can see the error rows, we know exactly what values we need to replace in order to fix up our analysis!
df.amount_in_kg = df.amount_in_kg \
.str.replace("..", "", regex=False) \
.str.replace("kg", "") \
.astype(float)
df
ingredient | amount_in_kg | |
---|---|---|
0 | carrots | 1.0 |
1 | rice | 1.0 |
2 | okra | NaN |
3 | onions | 3.0 |
4 | red beans | 2.0 |