fix: check overflow numbers while inferring type for csv files #6481

CookiePieWw · 2024-09-30T09:24:00Z

Which issue does this PR close?

Related to #2580. Also related to apache/datafusion#3174

Rationale for this change

Currently we use regex to infer types in .csv files. The regex for Int64 is (^-?(\d+)$) which accepts all numbers even overflow (this caused apache/datafusion#3174). Initially I think we can use a regex that match the numbers in range, but the regex will be too long (more than 300 chars as I tried.

We can turn to a function trying to parse the string to i64, which is simple and flexible. The original regex could be kept or changed to more effective funtions if needed.

What changes are included in this PR?

Change the regex mentioned above to funtions.

I only changed the boolean and i64 to functions since it's obvious. The regex of decimal is extended to accept overflowing numbers. Other regex is kept. I also add a TODO for further improvements. (I'd like to try to change it later if following questions are addressed)

Some questions here:

The regex of timestamp is ^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(?:[^\d\.].*)?$ which accept some illegal timestamps like 1000-00-00T11:11:11(adewoifas). I wonder if it's alright to use chrono::NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S").is_ok() to replace it.
The tests are performed over the file uk_cities.csv which has meaningful content. I wonder if it's alright to add some meaningless strings to it for testing.

Are there any user-facing changes?

The numbers that overflow now will be inferred as decimal type instead of int64.

tustvold · 2024-09-30T11:43:59Z

I worry this will significantly regress performance as well as regress functionality. The change to use NaiveDatetime will break timestamp inference for timestamps with timezones, for example. Moving away from RegexSet to serially checking Regex expressions will significantly slow down performance.

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

CookiePieWw · 2024-09-30T12:23:47Z

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

Got you. I've changed it to a check for overflow now :)

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

We may have to find a type for large integers however, maybe utf8's better?

tustvold · 2024-09-30T12:36:28Z

We may have to find a type for large integers however, maybe utf8's better?

Yeah I think utf8 is probably the safest option, float runs the risk of silent truncation, whereas decimal comes with its own complexities and can itself overflow.

I think all this needs now is some tests.

I don't think this is a breaking change, as the fact it inferred Int64 for values larger than fit in an Int64 could reasonably be considered a bug.

alamb

Thank you @CookiePieWw and @tustvold

The code now looks good to me -- it and it seems the overhead of checking the string length will be relatively minor compared to regex matching

I looked for benchmarks, but the only thing I see in https://github.com/apache/arrow-rs/blob/b2458bd686e5bc75397fde4a25f3a8b6c42ab064/arrow/benches/csv_reader.rs is for the actual reading (not type inference)

github-actions bot added the arrow Changes to the arrow crate label Sep 30, 2024

refactor: detect overflow for type inference

ec1f32f

CookiePieWw force-pushed the master branch from 981316f to ec1f32f Compare September 30, 2024 12:18

CookiePieWw changed the title ~~refactor: use functions to infer types in csv~~ fix: check overflow numbers while inferring type for csv files Sep 30, 2024

chore: fallback to utf8 and tests

14e5353

alamb approved these changes Oct 1, 2024

View reviewed changes

tustvold merged commit 4389cf9 into apache:master Oct 2, 2024
23 checks passed

This was referenced Oct 2, 2024

Bug with csv type inference apache/datafusion#3174

Closed

Upgrade arrow/parquet to 53.1.0 / fix clippy apache/datafusion#12724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: check overflow numbers while inferring type for csv files #6481

fix: check overflow numbers while inferring type for csv files #6481

CookiePieWw commented Sep 30, 2024 •

edited

Loading

tustvold commented Sep 30, 2024 •

edited

Loading

CookiePieWw commented Sep 30, 2024

tustvold commented Sep 30, 2024

alamb left a comment

fix: check overflow numbers while inferring type for csv files #6481

fix: check overflow numbers while inferring type for csv files #6481

Conversation

CookiePieWw commented Sep 30, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Sep 30, 2024 • edited Loading

CookiePieWw commented Sep 30, 2024

tustvold commented Sep 30, 2024

alamb left a comment

Choose a reason for hiding this comment

CookiePieWw commented Sep 30, 2024 •

edited

Loading

tustvold commented Sep 30, 2024 •

edited

Loading