-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: read_csv handling of bad lines (too many/few fields) #15122
Comments
Sounds good (and I consider this a pretty critical enhancement). I wonder if we give the user the option of customised error handling? For example, maybe the users knows that in the case of extra fields, one should actually remove the first fields, not the last. Or different default values to use with missing fields. One could e.g. allow the user to pass a function:
One could provide shortcuts e.g. in |
@kodonnell See also #5686 for this idea of being able to specify a function to process a bad line in a custom way. |
Wow - how do you guys keep track of all of these issues?! Is your above suggestion still your preferred approach? |
The above idea is not yet fully worked out. For example, do we want to make a distinction between lines that have too many or to few fields when specifying how to deal with those (raise, ignore, ..)? Do we want to keep the 'warn' option as a separate keyword? A more detailed proposal of how the API could look like is certainly welcome. |
I was more meaning that the above suggestion and #5686 are supersets of this. I.e. if we think that approach is more valuable, then we could focus on that instead of the API here. |
This is very useful as @jorisvandenbossche pointed out, I encounter a situation where lines with fewer number of columns must be raised.
This gives the most broad coverage of all use cases. |
Currently
read_csv
has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):error_bad_lines=false
rows with too many fields will be dropped instead of raising an error (and in that case,warn_bad_lines
controls to get a warning or not)usecols
you can select certain columns, and in this way deal with rows with too many fields.Some possibilities are missing in this scheme:
Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.
In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:
Provide more fine grained control in a new keyword (and deprecate
error_bad_lines
):or leave out
'warn'
and keepwarn_bad_lines
to be able to combine a warning with both 'skip' and 'process'.We should further think about whether we can integrate this with the case of too few fields and not only too many.
I think it would be nice to have some better control here, but we should think a bit about the best API for this.
The text was updated successfully, but these errors were encountered: