API/ENH: read_csv handling of bad lines (too many/few fields) #15122

jorisvandenbossche · 2017-01-12T22:56:08Z

Currently read_csv has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):

by default, it will error for too many fields, and fill with NaNs for too few fields
with error_bad_lines=false rows with too many fields will be dropped instead of raising an error (and in that case, warn_bad_lines controls to get a warning or not)
with usecols you can select certain columns, and in this way deal with rows with too many fields.

Some possibilities are missing in this scheme:

"process" bad lines with too many fields, i.e. drop the excessive fields instead of either raising an error or dropping the full row (discussed in usecols dooesn't help with unclean csv's #9549)
getting a warning or error with too few fields instead of automatically filling with NaNs (asked for in "Bad" lines with too few fields #9729), or dropping those rows

Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.

In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:

Provide more fine grained control in a new keyword (and deprecate error_bad_lines):

bad_lines='error'|'warn'|'skip'|'process'

or leave out 'warn' and keep warn_bad_lines to be able to combine a warning with both 'skip' and 'process'.

We should further think about whether we can integrate this with the case of too few fields and not only too many.

I think it would be nice to have some better control here, but we should think a bit about the best API for this.

The text was updated successfully, but these errors were encountered:

kodonnell · 2017-01-31T00:54:42Z

Sounds good (and I consider this a pretty critical enhancement). I wonder if we give the user the option of customised error handling? For example, maybe the users knows that in the case of extra fields, one should actually remove the first fields, not the last. Or different default values to use with missing fields. One could e.g. allow the user to pass a function:

@param *row* list of fields (strings) for a 'bad' row
@param *reason* the reason this row is 'bad'
@returns `None` if row is to be ignored, otherwise a list of fields (strings) for the corrected row
def row_handler(row, reason):
    ...

One could provide shortcuts e.g. in read_csv instead of passing a function I pass a string 'ignore_errors' which is equivalent to passing lambda x,y: None, etc. In that sense, it can be made equivalent to your suggested API above, with the option of custom behaviour if required.

jorisvandenbossche · 2017-01-31T09:41:55Z

@kodonnell See also #5686 for this idea of being able to specify a function to process a bad line in a custom way.

kodonnell · 2017-01-31T16:53:05Z

Wow - how do you guys keep track of all of these issues?! Is your above suggestion still your preferred approach?

jorisvandenbossche · 2017-01-31T17:05:52Z

The above idea is not yet fully worked out. For example, do we want to make a distinction between lines that have too many or to few fields when specifying how to deal with those (raise, ignore, ..)? Do we want to keep the 'warn' option as a separate keyword?

A more detailed proposal of how the API could look like is certainly welcome.

kodonnell · 2017-01-31T17:20:28Z

I was more meaning that the above suggestion and #5686 are supersets of this. I.e. if we think that approach is more valuable, then we could focus on that instead of the API here.

xuancong84 · 2019-09-06T09:43:55Z

This is very useful as @jorisvandenbossche pointed out, I encounter a situation where lines with fewer number of columns must be raised.
And in general, I would like to suggest adding the following two arguments:

bad_lines_if_cols = '<' | '>' | '!='
bad_lines = 'error'|'warn'|'skip'|'conform'

This gives the most broad coverage of all use cases.
There will be cases when there are more columns, just ignore extra columns; or when there are fewer columns, something is wrong with the preprocessing and it must be raised.

jorisvandenbossche added API Design Enhancement IO CSV read_csv, to_csv labels Jan 12, 2017

jorisvandenbossche added this to the Next Major Release milestone Jan 12, 2017

jorisvandenbossche mentioned this issue Jan 12, 2017

usecols dooesn't help with unclean csv's #9549

Closed

jorisvandenbossche mentioned this issue Jan 30, 2017

"Bad" lines with too few fields #9729

Open

kodonnell mentioned this issue Jan 31, 2017

Add ability to process bad lines for read_csv #5686

Closed

jorisvandenbossche mentioned this issue Aug 24, 2017

[Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) #17319

Closed

dhimmel mentioned this issue Jan 14, 2018

Parsing citation tags manubot/manubot#26

Closed

lithomas1 added the Deprecate Functionality to remove in pandas label Mar 11, 2021

lithomas1 self-assigned this Mar 12, 2021

This was referenced Mar 13, 2021

DEPR: error_bad_lines and warn_bad_lines for read_csv #40413

Merged

API/DEPR: error_bad_lines/warn_bad_lines in pd.read_csv #22677

Closed

jreback modified the milestones: Contributions Welcome, 1.3 May 6, 2021

mroeschke removed API Design Enhancement labels May 8, 2021

jreback closed this as completed in #40413 May 28, 2021

davetapley mentioned this issue Aug 1, 2024

ENH: read_xml handling of bad lines #59384

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

jorisvandenbossche commented Jan 12, 2017

kodonnell commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017 •

edited

Loading

kodonnell commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017

kodonnell commented Jan 31, 2017

xuancong84 commented Sep 6, 2019

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Comments

jorisvandenbossche commented Jan 12, 2017

kodonnell commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017 • edited Loading

kodonnell commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017

kodonnell commented Jan 31, 2017

xuancong84 commented Sep 6, 2019

jorisvandenbossche commented Jan 31, 2017 •

edited

Loading