CsvFormat infer_schema
reports UnequalLengths
error despite having quotes and escape in its options
#13087
Labels
bug
Something isn't working
Describe the bug
CsvFormat
infer_schema
reportsUnequalLengths
error despite having quotes and escape in its options.This would suprise user because
SessionContext::register_csv
acceptsCsvReadOptions
butinfer_schema
somehow does not fully use it.To Reproduce
For this csv file
test.csv
:Note that some columns are quoted with
"
and have escape character\
inside.This test would fail:
The error is
Encountered unequal lengths between records on CSV file whilst inferring schema. Expected 4 records, found 5 records
.Expected behavior
register_csv
should not returnErr
becauseCsvReadOptions
has specified header, quotes and escape character.Underlying csv reader should use this option to infer schema.
Additional context
If a schema is provided to
CsvReadOptions
and is correct totest.csv
, then the test is passed and the csv table can be used.After some debugging, I found that the creation of
arrow::csv::reader::Format
inCsvFormat::infer_schema_from_stream
does not use the quotes and escape settings inCsvFormat
which is odd to me.datafusion/datafusion/core/src/datasource/file_format/csv.rs
Lines 440 to 456 in f2da32b
I did dig further into the
arrow-csv
andcsv
crate, and the quotation and escaping options are all there, I think if the right option is passed to it,infer_schema
would be more easy to use.The text was updated successfully, but these errors were encountered: