-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144
Comments
Need to understand what |
The following is from the docs for that config.
There is also a corresponding
Because we have issues with reading/writing dates/times around this period anyways I'm not sure how critical it is that we fix it. is the function that decides what to do for reads. |
So the next step is to decide if this should be a feature of cudf or if it should be something we only do in spark. The reading side feels like it should be a part of cudf. You really want cudf to be abel to read data from any data source correctly. On the write side we need to understand what cudf currently does for timestamps in this range and then we can adjust appropriately. We are also going to want to be sure that we insert the proper metadata so that the spark CPU can read the data correctly too. |
A little more information. When I look through the parquet reader/writer code in cudf it looks like there is no special case processing for DATE or TIMESTAMP. This should correspond to the "CORRECTED" mode for Spark. For writes we can probably look at this config and the types in the schema to decide if we can support this or not. We can also add in support for throwing an exception for writes if the config is set to that too. The reader mode is going to be a little more difficult. We don't support reading parquet metadata with the current cudf java API, and arguably I think cudf would want to be able to read these types of dates/timestamps correctly on any platform. So we might want to open up a discussion with cudf on what is the right way to handle this. @sameerz is this still a P1? If so I can put in returning metadata from the parquet reader and start to look at what it would take to rebase the timestamps/dates to match spark. |
For now I am going to propose that we do the following. When writing a parquet file we fall back to the CPU if When it is "EXCEPTION" we scan all of the output for days and timestamps that are out of the supported range and we throw an exception if we see any of them. We then file a follow in issue to try and support "LEGACY" mode for writes. For We also file a follow on issue to go back and look at supporting the rebase logic for reads, which I think is more important than the rebase logic for writes. |
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Describe the bug
When writing parquet the config spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored. We should at least check it and verify which behavior we are supporting.
The text was updated successfully, but these errors were encountered: