Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting large dates (i.e. +10999-12-31) from string to Date32 #7074

Merged
merged 21 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
93934bf
Support converting large dates (i.e. +10999-12-31) from string to Date32
phillipleblanc Feb 4, 2025
4041947
Fix lint
phillipleblanc Feb 4, 2025
a584828
Update arrow-cast/src/parse.rs
phillipleblanc Feb 10, 2025
458773c
fix: issue introduced in #6833 - less than equal check for scale in …
himadripal Feb 5, 2025
3a8a001
minor: re-export `OffsetBufferBuilder` in `arrow` crate (#7077)
alamb Feb 6, 2025
cb54440
Add another decimal cast edge test case (#7078)
findepi Feb 6, 2025
7e91c46
Support both 0x01 and 0x02 as type for list of booleans in thrift met…
jhorstmann Feb 6, 2025
bb5f3ae
Fix LocalFileSystem with range request that ends beyond end of file (…
kylebarron Feb 6, 2025
e199ccc
Introduce `UnsafeFlag` to manage disabling `ArrayData` validation (#7…
alamb Feb 6, 2025
6ec6cd9
Refactor arrow-ipc: Rename `ArrayReader` to `RecodeBatchDecoder` (#7028)
alamb Feb 6, 2025
02ee7d2
Minor: Update release schedule (#7086)
alamb Feb 7, 2025
ec1d17a
Refactor some decimal-related code and tests (#7062)
CurtHagenlocher Feb 8, 2025
706a523
Refactor arrow-ipc: Move `create_*_array` methods into `RecordBatchDe…
alamb Feb 8, 2025
0d943b9
Print Parquet BasicTypeInfo id when present (#7094)
devinrsmith Feb 8, 2025
b339382
Add a custom implementation `LocalFileSystem::list_with_offset` (#7019)
corwinjoy Feb 8, 2025
1738b57
fix: first none/empty list in `ListArray` panics in `cast_with_option…
irenjj Feb 8, 2025
d74be2c
Benchmarks for Arrow IPC writer (#7090)
alamb Feb 8, 2025
5f69b6e
Minor: Clarify documentation on `NullBufferBuilder::allocated_size` (…
alamb Feb 9, 2025
6dcdde6
Add more tests for edge cases
phillipleblanc Feb 10, 2025
05d500f
Add negative test case for incorrectly formatted large dates
phillipleblanc Feb 10, 2025
f0bcaf1
Merge remote-tracking branch 'origin/main' into phillip/250205-handle…
phillipleblanc Feb 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions arrow-cast/src/cast/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4229,6 +4229,48 @@ mod tests {
}
}

#[test]
fn test_cast_string_with_large_date_to_date32() {
let array = Arc::new(StringArray::from(vec![
Some("+10999-12-31"),
Some("-0010-02-28"),
phillipleblanc marked this conversation as resolved.
Show resolved Hide resolved
Some("0010-02-28"),
Some("0000-01-01"),
Some("-0000-01-01"),
Some("-0001-01-01"),
])) as ArrayRef;
let to_type = DataType::Date32;
let options = CastOptions {
phillipleblanc marked this conversation as resolved.
Show resolved Hide resolved
safe: false,
format_options: FormatOptions::default(),
};
let b = cast_with_options(&array, &to_type, &options).unwrap();
let c = b.as_primitive::<Date32Type>();
assert_eq!(3298139, c.value(0)); // 10999-12-31
assert_eq!(-723122, c.value(1)); // -0010-02-28
assert_eq!(-715817, c.value(2)); // 0010-02-28
assert_eq!(c.value(3), c.value(4)); // Expect 0000-01-01 and -0000-01-01 to be parsed the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert_eq!(-719528, c.value(3)); // 0000-01-01
assert_eq!(-719528, c.value(4)); // -0000-01-01
assert_eq!(-719893, c.value(5)); // -0001-01-01
}

#[test]
fn test_cast_invalid_string_with_large_date_to_date32() {
// Large dates need to be prefixed with a + or - sign, otherwise they are not parsed correctly
let array = Arc::new(StringArray::from(vec![Some("10999-12-31")])) as ArrayRef;
let to_type = DataType::Date32;
let options = CastOptions {
safe: false,
format_options: FormatOptions::default(),
};
let err = cast_with_options(&array, &to_type, &options).unwrap_err();
assert_eq!(
err.to_string(),
"Cast error: Cannot cast string '10999-12-31' to value of Date32 type"
);
}

#[test]
fn test_cast_string_format_yyyymmdd_to_date32() {
let a0 = Arc::new(StringViewArray::from(vec![
Expand Down
26 changes: 26 additions & 0 deletions arrow-cast/src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -595,6 +595,32 @@ const EPOCH_DAYS_FROM_CE: i32 = 719_163;
const ERR_NANOSECONDS_NOT_SUPPORTED: &str = "The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804";

fn parse_date(string: &str) -> Option<NaiveDate> {
// If the date has an extended (signed) year such as "+10999-12-31" or "-0012-05-06"
phillipleblanc marked this conversation as resolved.
Show resolved Hide resolved
//
// According to [ISO 8601], years have:
// Four digits or more for the year. Years in the range 0000 to 9999 will be pre-padded by
// zero to ensure four digits. Years outside that range will have a prefixed positive or negative symbol.
//
// [ISO 8601]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE
if string.starts_with('+') || string.starts_with('-') {
// Skip the sign and look for the hyphen that terminates the year digits.
// According to ISO 8601 the unsigned part must be at least 4 digits.
let rest = &string[1..];
let hyphen = rest.find('-')?;
if hyphen < 4 {
return None;
}
// The year substring is the sign and the digits (but not the separator)
// e.g. for "+10999-12-31", hyphen is 5 and s[..6] is "+10999"
let year: i32 = string[..hyphen + 1].parse().ok()?;
// The remainder should begin with a '-' which we strip off, leaving the month-day part.
let remainder = string[hyphen + 1..].strip_prefix('-')?;
let mut parts = remainder.splitn(2, '-');
let month: u32 = parts.next()?.parse().ok()?;
let day: u32 = parts.next()?.parse().ok()?;
return NaiveDate::from_ymd_opt(year, month, day);
}

if string.len() > 10 {
// Try to parse as datetime and return just the date part
return string_to_datetime(&Utc, string)
Expand Down
Loading