Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvFormat infer_schema reports UnequalLengths error despite having quotes and escape in its options #13087

Closed
kolulu23 opened this issue Oct 24, 2024 · 1 comment · Fixed by #13214
Assignees
Labels
bug Something isn't working

Comments

@kolulu23
Copy link

Describe the bug

CsvFormat infer_schema reports UnequalLengths error despite having quotes and escape in its options.

This would suprise user because SessionContext::register_csv accepts CsvReadOptions but infer_schema somehow does not fully use it.

To Reproduce

For this csv file test.csv:

c1,c2,c3,c4
2166.105475712115,")8P~f(Je/+\",@pV<",g$vGzWhTxeZzXc!{,0

Note that some columns are quoted with " and have escape character \ inside.

This test would fail:

#[cfg(test)]
mod test {
    use datafusion::error::DataFusionError;
    use datafusion::prelude::{CsvReadOptions, SessionContext};

    #[tokio::test]
    async fn infer_schema_failure() {
        let ctx = SessionContext::new();
        let r = ctx
            .register_csv(
                "test",
                "test.csv",
                CsvReadOptions::new()
                    .has_header(true)
                    .quote(b'"')
                    .escape(b'\\'),
            )
            .await;
            assert!(r.is_ok());
    }
}

The error is Encountered unequal lengths between records on CSV file whilst inferring schema. Expected 4 records, found 5 records.

Expected behavior

register_csv should not return Err because CsvReadOptions has specified header, quotes and escape character.

Underlying csv reader should use this option to infer schema.

Additional context

If a schema is provided to CsvReadOptions and is correct to test.csv, then the test is passed and the csv table can be used.

After some debugging, I found that the creation of arrow::csv::reader::Format in CsvFormat::infer_schema_from_stream does not use the quotes and escape settings in CsvFormat which is odd to me.

while let Some(chunk) = stream.next().await.transpose()? {
let mut format = arrow::csv::reader::Format::default()
.with_header(
first_chunk
&& self
.options
.has_header
.unwrap_or(state.config_options().catalog.has_header),
)
.with_delimiter(self.options.delimiter);
if let Some(comment) = self.options.comment {
format = format.with_comment(comment);
}
let (Schema { fields, .. }, records_read) =
format.infer_schema(chunk.reader(), Some(records_to_read))?;

I did dig further into the arrow-csv and csv crate, and the quotation and escaping options are all there, I think if the right option is passed to it, infer_schema would be more easy to use.

@kolulu23 kolulu23 added the bug Something isn't working label Oct 24, 2024
@mnorfolk03
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants