[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

revans2 · 2020-06-10T12:54:58Z

Describe the bug
When writing parquet the config spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored. We should at least check it and verify which behavior we are supporting.

sameerz · 2020-07-22T21:37:31Z

Need to understand what spark.sql.legacy.parquet.datetimeRebaseModeInWrite does and then will reprioritize.

revans2 · 2020-07-23T14:37:49Z

The following is from the docs for that config.

When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid
(Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and
write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees
ancient dates/timestamps that are ambiguous between the two calendars.

There is also a corresponding spark.sql.legacy.parquet.datetimeRebaseModeInRead config. The main difference with this one is that it looks like spark tries to be smart when reading the data

This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.

Because we have issues with reading/writing dates/times around this period anyways I'm not sure how critical it is that we fix it.

https://github.com/apache/spark/blob/a8e3de36e7d543f1c7923886628ac3178f45f512/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L90-L108

is the function that decides what to do for reads.

revans2 · 2020-07-23T14:40:22Z

So the next step is to decide if this should be a feature of cudf or if it should be something we only do in spark. The reading side feels like it should be a part of cudf. You really want cudf to be abel to read data from any data source correctly. On the write side we need to understand what cudf currently does for timestamps in this range and then we can adjust appropriately. We are also going to want to be sure that we insert the proper metadata so that the spark CPU can read the data correctly too.

revans2 · 2020-07-23T19:35:50Z

A little more information. When I look through the parquet reader/writer code in cudf it looks like there is no special case processing for DATE or TIMESTAMP. This should correspond to the "CORRECTED" mode for Spark. For writes we can probably look at this config and the types in the schema to decide if we can support this or not. We can also add in support for throwing an exception for writes if the config is set to that too.

The reader mode is going to be a little more difficult. We don't support reading parquet metadata with the current cudf java API, and arguably I think cudf would want to be able to read these types of dates/timestamps correctly on any platform. So we might want to open up a discussion with cudf on what is the right way to handle this.

@sameerz is this still a P1? If so I can put in returning metadata from the parquet reader and start to look at what it would take to rebase the timestamps/dates to match spark.

revans2 · 2020-07-24T14:54:56Z

For now I am going to propose that we do the following.

When writing a parquet file we fall back to the CPU if spark.sql.legacy.parquet.datetimeRebaseModeInWrite is not "CORRECTED" or "EXCEPTION" and we have any time columns.

When it is "EXCEPTION" we scan all of the output for days and timestamps that are out of the supported range and we throw an exception if we see any of them. We then file a follow in issue to try and support "LEGACY" mode for writes.

For spark.sql.legacy.parquet.datetimeRebaseModeInRead it is a much harder problem. For now I would say that we implement similar logic to what spark supports to look at the metadata for each file and decide what mode we are going to operate in for that file. If the file mode is what we support we just read it. If it is a mode that we do not support "LEGACY" we treat it as "EXCEPTION" and scan the data we read to look for any dates that we cannot support. If we hit these dates we throw an exception. If not we can read the data without any issues. We might also want to add in a config that allows us to fall back to the CPU just for date/timestamp columns so users might be able to get some GPU acceleration until we can fully support the new formats.

We also file a follow on issue to go back and look at supporting the rebase logic for reads, which I think is more important than the rebase logic for writes.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels Jun 10, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 29, 2020

sameerz added the P0 Must have for release label Jul 22, 2020

revans2 self-assigned this Jul 24, 2020

This was referenced Jul 27, 2020

Add in checks for Parquet LEGACY date/time rebase #435

Merged

[FEA] Support LEGACY mode for parquet reads and writes #436

Closed

revans2 closed this as completed in #435 Jul 27, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Remove unused CMake files (NVIDIA#144)

74b5d3d

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

revans2 commented Jun 10, 2020

sameerz commented Jul 22, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 24, 2020

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

Comments

revans2 commented Jun 10, 2020

sameerz commented Jul 22, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 23, 2020

revans2 commented Jul 24, 2020