-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: check overflow numbers while inferring type for csv files #6481
Conversation
I worry this will significantly regress performance as well as regress functionality. The change to use NaiveDatetime will break timestamp inference for timestamps with timezones, for example. Moving away from RegexSet to serially checking Regex expressions will significantly slow down performance. Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads |
Got you. I've changed it to a check for overflow now :)
We may have to find a type for large integers however, maybe utf8's better? |
Yeah I think utf8 is probably the safest option, float runs the risk of silent truncation, whereas decimal comes with its own complexities and can itself overflow. I think all this needs now is some tests. I don't think this is a breaking change, as the fact it inferred Int64 for values larger than fit in an Int64 could reasonably be considered a bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @CookiePieWw and @tustvold
The code now looks good to me -- it and it seems the overhead of checking the string length will be relatively minor compared to regex matching
I looked for benchmarks, but the only thing I see in https://github.com/apache/arrow-rs/blob/b2458bd686e5bc75397fde4a25f3a8b6c42ab064/arrow/benches/csv_reader.rs is for the actual reading (not type inference)
Which issue does this PR close?
Related to #2580. Also related to apache/datafusion#3174
Rationale for this change
Currently we use regex to infer types in .csv files. The regex for
Int64
is(^-?(\d+)$)
which accepts all numbers even overflow (this caused apache/datafusion#3174). Initially I think we can use a regex that match the numbers in range, but the regex will be too long (more than 300 chars as I tried.We can turn to a function trying to parse the string to
i64
, which is simple and flexible. The original regex could be kept or changed to more effective funtions if needed.What changes are included in this PR?
Change the regex mentioned above to funtions.
I only changed the boolean and i64 to functions since it's obvious. The regex of decimal is extended to accept overflowing numbers. Other regex is kept. I also add a TODO for further improvements. (I'd like to try to change it later if following questions are addressed)
Some questions here:
^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(?:[^\d\.].*)?$
which accept some illegal timestamps like1000-00-00T11:11:11(adewoifas)
. I wonder if it's alright to usechrono::NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S").is_ok()
to replace it.uk_cities.csv
which has meaningful content. I wonder if it's alright to add some meaningless strings to it for testing.Are there any user-facing changes?
The numbers that overflow now will be inferred as decimal type instead of int64.