Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Closed
jorisvandenbossche opened this issue Jan 12, 2017 · 6 comments · Fixed by #40413
Closed

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

jorisvandenbossche opened this issue Jan 12, 2017 · 6 comments · Fixed by #40413
Assignees
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently read_csv has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):

  • by default, it will error for too many fields, and fill with NaNs for too few fields
  • with error_bad_lines=false rows with too many fields will be dropped instead of raising an error (and in that case, warn_bad_lines controls to get a warning or not)
  • with usecols you can select certain columns, and in this way deal with rows with too many fields.

Some possibilities are missing in this scheme:

Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.

In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:

Provide more fine grained control in a new keyword (and deprecate error_bad_lines):

bad_lines='error'|'warn'|'skip'|'process'

or leave out 'warn' and keep warn_bad_lines to be able to combine a warning with both 'skip' and 'process'.

We should further think about whether we can integrate this with the case of too few fields and not only too many.

I think it would be nice to have some better control here, but we should think a bit about the best API for this.

@kodonnell
Copy link

Sounds good (and I consider this a pretty critical enhancement). I wonder if we give the user the option of customised error handling? For example, maybe the users knows that in the case of extra fields, one should actually remove the first fields, not the last. Or different default values to use with missing fields. One could e.g. allow the user to pass a function:

@param *row* list of fields (strings) for a 'bad' row
@param *reason* the reason this row is 'bad'
@returns `None` if row is to be ignored, otherwise a list of fields (strings) for the corrected row
def row_handler(row, reason):
    ...

One could provide shortcuts e.g. in read_csv instead of passing a function I pass a string 'ignore_errors' which is equivalent to passing lambda x,y: None, etc. In that sense, it can be made equivalent to your suggested API above, with the option of custom behaviour if required.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 31, 2017

@kodonnell See also #5686 for this idea of being able to specify a function to process a bad line in a custom way.

@kodonnell
Copy link

Wow - how do you guys keep track of all of these issues?! Is your above suggestion still your preferred approach?

@jorisvandenbossche
Copy link
Member Author

The above idea is not yet fully worked out. For example, do we want to make a distinction between lines that have too many or to few fields when specifying how to deal with those (raise, ignore, ..)? Do we want to keep the 'warn' option as a separate keyword?

A more detailed proposal of how the API could look like is certainly welcome.

@kodonnell
Copy link

I was more meaning that the above suggestion and #5686 are supersets of this. I.e. if we think that approach is more valuable, then we could focus on that instead of the API here.

@xuancong84
Copy link

This is very useful as @jorisvandenbossche pointed out, I encounter a situation where lines with fewer number of columns must be raised.
And in general, I would like to suggest adding the following two arguments:

bad_lines_if_cols = '<' | '>' | '!='
bad_lines = 'error'|'warn'|'skip'|'conform'

This gives the most broad coverage of all use cases.
There will be cases when there are more columns, just ignore extra columns; or when there are fewer columns, something is wrong with the preprocessing and it must be raised.

@lithomas1 lithomas1 added the Deprecate Functionality to remove in pandas label Mar 11, 2021
@lithomas1 lithomas1 self-assigned this Mar 12, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants