Debugging astype errors in pandas¶

In this notebook, we'll look at a solution to the pandas error message ValueError: invalid literal for int() with base 10

The problem¶

When cleaning up a large CSV or Excel file using pandas, you often need to convert columns from strings to numbers. The most common approach is using .astype(int) or .astype(float).

Sometimes the data isn't perfect, though. Take a look at the dataset below - two of our amount_in_kg rows aren't actually numbers.

In [14]:

            
                Copied!
                
import pandas as pd

df = pd.read_csv("bad-data.csv")
df
import pandas as pd

df = pd.read_csv("bad-data.csv")
df

Out[14]:

	ingredient	amount_in_kg
0	carrots	1
1	rice	1..
2	okra	NaN
3	onions	3kg
4	red beans	2

If we try to run .astype with float int, we'll end up with the error ValueError: invalid literal for int() with base 10 or ValueError: could not convert string to float: '1..'. This is because both the 1.. and 3kg values are not integers and need to be cleaned up.

In [33]:

            
                Copied!
                
df.amount_in_kg.astype(int)
df.amount_in_kg.astype(int)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2174048608.py in <cell line: 1>()
----> 1 df.amount_in_kg.astype(int)

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5910         else:
   5911             # else, only a single dtype is given
-> 5912             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5913             return self._constructor(new_data).__finalize__(self, method="astype")
   5914 

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    417 
    418     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 419         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    420 
    421     def convert(

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    302                     applied = b.apply(f, **kwargs)
    303                 else:
--> 304                     applied = getattr(b, f)(**kwargs)
    305             except (TypeError, NotImplementedError):
    306                 if not ignore_failures:

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    578         values = self.values
    579 
--> 580         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    581 
    582         new_values = maybe_coerce_values(new_values)

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors)
   1290 
   1291     try:
-> 1292         new_values = astype_array(values, dtype, copy=copy)
   1293     except (ValueError, TypeError):
   1294         # e.g. astype_nansafe can fail on object-dtype of strings

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy)
   1235 
   1236     else:
-> 1237         values = astype_nansafe(values, dtype, copy=copy)
   1238 
   1239     # in pandas we don't store numpy str dtypes, so convert to object

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1152         # work around NumPy brokenness, #1987
   1153         if np.issubdtype(dtype.type, np.integer):
-> 1154             return lib.astype_intsafe(arr, dtype)
   1155 
   1156         # if we have a datetime/timedelta array of objects

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '1..'

In [34]:

            
                Copied!
                
df.amount_in_kg.astype(float)
df.amount_in_kg.astype(float)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_45372/2789198030.py in <cell line: 1>()
----> 1 df.amount_in_kg.astype(float)

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5910         else:
   5911             # else, only a single dtype is given
-> 5912             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5913             return self._constructor(new_data).__finalize__(self, method="astype")
   5914 

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    417 
    418     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 419         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    420 
    421     def convert(

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    302                     applied = b.apply(f, **kwargs)
    303                 else:
--> 304                     applied = getattr(b, f)(**kwargs)
    305             except (TypeError, NotImplementedError):
    306                 if not ignore_failures:

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    578         values = self.values
    579 
--> 580         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    581 
    582         new_values = maybe_coerce_values(new_values)

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors)
   1290 
   1291     try:
-> 1292         new_values = astype_array(values, dtype, copy=copy)
   1293     except (ValueError, TypeError):
   1294         # e.g. astype_nansafe can fail on object-dtype of strings

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy)
   1235 
   1236     else:
-> 1237         values = astype_nansafe(values, dtype, copy=copy)
   1238 
   1239     # in pandas we don't store numpy str dtypes, so convert to object

~/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1179     if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
   1180         # Explicit copy, or required since NumPy can't view from / to object.
-> 1181         return arr.astype(dtype, copy=True)
   1182 
   1183     return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: '1..'

It's easy enough for us to see and fix the issue with this small data, but what if we have a larger dataset? We'll need to just select the rows with bad data for further inspection.

Finding the bad data¶

Instead of using .astype(int), we're going to convert our column with pd.to_numeric. It's a similar command with a very useful difference.

When .astype(int) encounters an error, it has two options:

Raise an error
Ignore the error and return the original data

pd.to_numeric has three options:

Raise an error
Ignore the error and return the original data
Convert the "bad" data to NaN

We'll take advantage of that last option! To find the bad data, we can use errors='coerce' to turn error values into NaN...

In [37]:

            
                Copied!
                
# The first and last rows were converted
pd.to_numeric(df.amount_in_kg, errors='coerce')
# The first and last rows were converted
pd.to_numeric(df.amount_in_kg, errors='coerce')

Out[37]:

0    1.0
1    NaN
2    NaN
3    NaN
4    2.0
Name: amount_in_kg, dtype: float64

...and then use those NaN values to filter for all the rows with an unsuccessful conversion.

In [38]:

            
                Copied!
                
df[pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
df[pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]

Out[38]:

	ingredient	amount_in_kg
1	rice	1..
2	okra	NaN
3	onions	3kg

In our case, okra originally had missing data for amount_in_kg, it isn't the result of an error in converting to a number. If we don't want it to show up, we can also filter for places where the original value was not missing.

In [39]:

            
                Copied!
                
df[df.amount_in_kg.notna() & pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]
df[df.amount_in_kg.notna() & pd.to_numeric(df.amount_in_kg, errors='coerce').isna()]

Out[39]:

	ingredient	amount_in_kg
1	rice	1..
3	onions	3kg

Fixing the bad data¶

Now that we can see the error rows, we know exactly what values we need to replace in order to fix up our analysis!

In [41]:

            
                Copied!
                
df.amount_in_kg = df.amount_in_kg \
                    .str.replace("..", "", regex=False) \
                    .str.replace("kg", "") \
                    .astype(float)
df
df.amount_in_kg = df.amount_in_kg \
                    .str.replace("..", "", regex=False) \
                    .str.replace("kg", "") \
                    .astype(float)
df

Out[41]:

	ingredient	amount_in_kg
0	carrots	1.0
1	rice	1.0
2	okra	NaN
3	onions	3.0
4	red beans	2.0