-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column is not converting to numeric when errors=coerce #17125
Comments
this would need a reproducible example |
I think it's duplicate with #17007 |
@mficek : Potentially, but can't confirm yet |
It's certainly not an exact duplicate, as the example shown in #17007 also does not work correctly in 0.19, while here it is mentioned it worked in 0.19 |
And the numbers stay as strings |
@Sarickshah : Thanks for this! Could you do us a favor and move your example to your issue description above? Also, if you could provide the output that you're seeing as well as the expected output, that would be great for us as well. |
On 0.19 the first one is coerces (which seems expected, since it raises an error on parsing, but
|
Actually, this seems to work as well on 0.20.3:
@Sarickshah Can you show the exact output of what you get? |
looks fixed in 0.20.3. |
Doesn't seem to be fixed, could be something to do with the python binaries if it isn't reproducible? (Windows 7 x64 here)
I think it has something to do with the long (>20 character) number strings. This is taken from a sheet of ~6 million digits. If i do something like:
I get 5.2 million duplicate values - it seems like the function works until it encounters a problematic value .8 million rows in and then assigns the last valid retval to the remaining 5.2 million rows Edit - This works:
but this doesn't:
So it looks like any string with a character count >= 20 will break the to_numeric function |
Indeed, that example is not working correctly (both on master as in 0.20.3). The other example is working though, so the difference indeed seems to be the large number. So it seems that when the value would be converted to uint64 (instead of int64), the |
So you can see that the parsing of the big value (> 20 chars) itself is working, as the return value is uint64. When a NaN has to be introduced, it should just be converted to float64 as it happens with int64. |
I think the biggest point of confusion is that there is no exception raised when errors="coerce" and it fails to coerce anything. As this is more of a limitation of the underlying numpy dtypes I don't think there is a real fix here. Something simple like this would solve the point of confusion, and users would have the ability to figure out how to best handle it from there on out, whether it being to drop large numbers from the dataframe or leaving them as objects and manually pruning errors. I don't think coercing uint64 to float64 is the best way to handle it, and I would go as far as to suggest there should be a warning for int64 -> float64 conversion, because anything above 2**53 will create unforeseen problems for people unaware of the float64 limitations, for example:
|
I read in my dataframe with
And then I run the code:
but the column does not get converted. When I use errors = 'raise' it gives me the numbers that are not convertible but it should be dropping them with coerce.... This was working perfectly in Pandas 0.19 and i Updated to 0.20.3. Did the way to_numeric works change between the two versions?
The text was updated successfully, but these errors were encountered: