[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files #28477

cloud-fan · 2020-05-07T14:51:18Z

What changes were proposed in this pull request?

When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not.

Why are the changes needed?

Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values.

Does this PR introduce any user-facing change?

Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config.

How was this patch tested?

updated tests

cloud-fan · 2020-05-07T14:51:42Z

cc @MaxGekk @HyukjinKwon

cloud-fan · 2020-05-07T14:55:05Z

core/src/main/scala/org/apache/spark/SparkException.scala

@@ -48,5 +48,5 @@ private[spark] case class ExecutorDeadException(message: String)
 * Exception thrown when Spark returns different result after upgrading to a new version.
 */
 private[spark] class SparkUpgradeException(version: String, message: String, cause: Throwable)
-  extends SparkException("You may get a different result due to the upgrading of Spark" +
+  extends RuntimeException("You may get a different result due to the upgrading of Spark" +


We need to throw this exception in vectorized parquet reader, which only allows IOException to be thrown. So change it to unchecked exception. cc @xuanyuanking

cloud-fan · 2020-05-07T14:56:45Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -102,14 +103,14 @@
  // The timezone conversion to apply to int96 timestamps. Null if no conversion.
  private final ZoneId convertTz;
  private static final ZoneId UTC = ZoneOffset.UTC;
-  private final boolean rebaseDateTime;
+  private final String datetimeRebaseMode;


it's very hard to use Scala enum in Java, so I use string instead.

Did you consider to pass enum's id? Like LegacyBehaviorPolicy.EXCEPTION.id. This could be less expensive.

I don't think it affects perf, and the code will be less readable if we see mode == 0 instead of "LEGACY".equals(mode)

SparkQA · 2020-05-07T17:05:11Z

Test build #122407 has finished for PR 28477 at commit 88aacb9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroDeserializer(

MaxGekk · 2020-05-07T18:40:33Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

@@ -46,17 +47,40 @@ class AvroSerializer(
    rootCatalystType: DataType,
    rootAvroType: Schema,
    nullable: Boolean,
-    rebaseDateTime: Boolean) extends Logging {
+    datetimeRebaseMode: LegacyBehaviorPolicy.Value) extends Logging {


How about to add a type alias to LegacyBehaviorPolicy:

object LegacyBehaviorPolicy extends Enumeration { type LegacyBehaviorPolicy = Value val EXCEPTION, LEGACY, CORRECTED = Value }

This doesn't help match, as we need to access both object LegacyBehaviorPolicy and type LegacyBehaviorPolicy and we still need some prefix to distinguish them.

MaxGekk · 2020-05-07T18:47:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+
+  def newRebaseExceptionInRead(format: String): SparkUpgradeException = {
+    new SparkUpgradeException("3.0", "reading dates before 1582-10-15 or timestamps before " +
+      s"1900-01-01 from $format files can be ambiguous, as the files may be written by Spark 2.x " +


1900-01-01 - it floating time point (local date). We can point out concrete timestamp 1900-01-01 00:00:00Z

MaxGekk · 2020-05-07T19:15:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val LEGACY_PARQUET_REBASE_MODE_IN_READ =
+    buildConf("spark.sql.legacy.parquet.datetimeRebaseModeInRead")
+      .internal()
+      .doc("When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar " +


Actually, opposite "from hybrid to Proleptic Gregorian"

MaxGekk · 2020-05-07T19:16:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val LEGACY_AVRO_REBASE_MODE_IN_READ =
+    buildConf("spark.sql.legacy.avro.datetimeRebaseModeInRead")
+      .internal()
+      .doc("When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar " +


Please, fix this too

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

MaxGekk · 2020-05-07T19:48:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+      "from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can " +
+      s"set ${SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key} to 'LEGACY' to rebase the datetime " +
+      "values w.r.t. the calendar switch during writing, to get maximum interoperability, or set " +
+      s"${SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key} to 'CORRECTED' to write the datetime " +


IN_WRITE too

MaxGekk · 2020-05-07T19:50:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+      s"1900-01-01 from $format files can be ambiguous, as the files may be written by Spark 2.x " +
+      "or legacy versions of Hive, which uses a legacy hybrid calendar that is different from " +
+      "Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set " +
+      s"${SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key} to 'LEGACY' to rebase the datetime " +


The function can be called from Avro too, right. The config can be not relevant for Avro.

MaxGekk · 2020-05-07T19:56:54Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala

+  private val dateRebaseFunc: Int => Int = datetimeRebaseMode match {
+    case LegacyBehaviorPolicy.EXCEPTION =>
+      days: Int =>
+        if (days < RebaseDateTime.lastSwitchGregorianDay) {
+          throw DataSourceUtils.newRebaseExceptionInWrite("Parquet")
+        }
+        days
+    case LegacyBehaviorPolicy.LEGACY => RebaseDateTime.rebaseGregorianToJulianDays
+    case LegacyBehaviorPolicy.CORRECTED => identity[Int]
+  }
+
+  private val timestampRebaseFunc: Long => Long = datetimeRebaseMode match {
+    case LegacyBehaviorPolicy.EXCEPTION =>
+      micros: Long =>
+        if (micros < RebaseDateTime.lastSwitchGregorianTs) {
+          throw DataSourceUtils.newRebaseExceptionInWrite("Parquet")
+        }
+        micros
+    case LegacyBehaviorPolicy.LEGACY => RebaseDateTime.rebaseGregorianToJulianMicros
+    case LegacyBehaviorPolicy.CORRECTED => identity[Long]
+  }


This code is repeated again. Maybe move it to some common place somehow.

MaxGekk · 2020-05-07T20:00:11Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala

-                      SQLConf.LEGACY_PARQUET_REBASE_DATETIME_IN_WRITE.key -> rebase.toString) {
+                      SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> mode.toString) {


mode has String type, so, no need to call toString()

MaxGekk · 2020-05-07T20:01:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

        }
      }
    }

-    Seq(false, true).foreach { vectorized =>
+    Seq(true).foreach { vectorized =>


Is it just for debugging?

dilipbiswal · 2020-05-07T20:58:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. " +
+        "When EXCEPTION, which is the default, Spark will fail the reading if it sees " +
+        "ancient dates/timestamps that are ambiguous between the two calendars. This config is " +
+        "only affective if the writer info (like Spark, Hive) of the Parquet files is unknown.")


affective -> effective ?

dilipbiswal · 2020-05-07T20:58:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. " +
+        "When EXCEPTION, which is the default, Spark will fail the reading if it sees " +
+        "ancient dates/timestamps that are ambiguous between the two calendars. This config is " +
+        "only affective if the writer info (like Spark, Hive) of the Avro files is unknown.")


affective -> effective ?

dilipbiswal · 2020-05-07T21:02:30Z

@cloud-fan I have a question from a user's standpoint. I know that we are not really excepting this exception as it only happens when the data has really old date/timestamps. But say, i get this error while running an existing work-load during read. What are my options ? Should i be choosing the LEGACY or CORRECTED option ? Are we in a position to recommend one option vs other ?

cloud-fan · 2020-05-08T04:37:29Z

What are my options ? Should i be choosing the LEGACY or CORRECTED option ?

This needs the knowledge from users and Spark really doesn't know. If the files were written by Spark 2.x, let's choose LEGACY. If the files were written by Impala or other systems that use the standard calendar, let's choose CORRECTED. For the write side, it's similar and depends on who will read the files later.

dilipbiswal · 2020-05-08T05:21:29Z

@cloud-fan Thanks for the explanation.

SparkQA · 2020-05-08T10:16:04Z

Test build #122431 has finished for PR 28477 at commit f866d5a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-08T11:54:43Z

Test build #122435 has finished for PR 28477 at commit 149fb63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…Parquet/Avro files

SparkQA · 2020-05-11T07:05:02Z

Test build #122489 has finished for PR 28477 at commit fc31eea.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-05-11T06:44:18Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala

-              .options(extraOptions)
-              .save(path)
+            withSQLConf(
+              SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> "CORRECTED",


LegacyBehaviorPolicy.CORRECTED.toString ?

MaxGekk · 2020-05-11T06:45:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala

@@ -161,9 +161,10 @@ object DateTimeRebaseBenchmark extends SqlBasedBenchmark {
              Seq(true, false).foreach { modernDates =>
                Seq(false, true).foreach { rebase =>
                  benchmark.addCase(caseName(modernDates, dateTime, Some(rebase)), 1) { _ =>
+                    val mode = if (rebase) "LEGACY" else "CORRECTED"


LEGACY.toString?

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

SparkQA · 2020-05-11T08:02:46Z

Test build #122495 has finished for PR 28477 at commit cdcb55a.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-05-11T08:09:59Z

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

@@ -211,24 +212,25 @@ public void readIntegersWithRebase(
      WritableColumnVector c,
      int rowId,
      int level,
-      VectorizedValuesReader data) throws IOException {
+      VectorizedValuesReader data,
+      final boolean failIfRebase) throws IOException {


Just in case, what is purpose of final here?

means it's immutable and may help to optimize if (failIfRebase)

MaxGekk

LGTM

SparkQA · 2020-05-11T10:51:17Z

Test build #122494 has finished for PR 28477 at commit 7392e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-11T12:34:13Z

Test build #122496 has finished for PR 28477 at commit c8920ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

HyukjinKwon

LGTM too. Some nits

SparkQA · 2020-05-12T07:01:47Z

Test build #122535 has finished for PR 28477 at commit f906a54.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-12T07:47:43Z

retest this please

SparkQA · 2020-05-12T08:17:56Z

Test build #122539 has finished for PR 28477 at commit f906a54.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-12T09:40:27Z

retest this please

SparkQA · 2020-05-12T10:14:08Z

Test build #122541 has finished for PR 28477 at commit f906a54.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-12T13:11:32Z

retest this please

SparkQA · 2020-05-12T13:45:53Z

Test build #122546 has finished for PR 28477 at commit f906a54.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-12T13:54:07Z

retest this please

SparkQA · 2020-05-12T14:24:36Z

Test build #122549 has finished for PR 28477 at commit f906a54.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-12T15:09:41Z

Test build #122552 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-12T15:12:39Z

Test build #5001 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T05:59:26Z

Test build #122571 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T07:39:24Z

Test build #122576 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T08:57:11Z

Test build #5002 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-14T02:21:20Z

retest this please

SparkQA · 2020-05-14T02:52:59Z

Test build #122603 has finished for PR 28477 at commit eb61edb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-14T03:32:15Z

I am just going to merge to unblock Spark 3.0. All tests passed and the last change was only nits. I also manually ran the tests here to double check.

HyukjinKwon · 2020-05-14T03:32:30Z

Merged to master.

HyukjinKwon · 2020-05-14T03:34:37Z

@cloud-fan, can you open a backport PR? seems there are some conflicts.

…me values from/to Parquet/Avro files When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. updated tests Closes apache#28477 from cloud-fan/rebase. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

probot-autolabeler bot added AVRO CORE SQL labels May 7, 2020

cloud-fan commented May 7, 2020

View reviewed changes

MaxGekk reviewed May 7, 2020

View reviewed changes

dilipbiswal reviewed May 7, 2020

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-31405][SQL] Fail by default when reading/writing ancient datetime values from/to Parquet/Avro files~~ [SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files May 10, 2020

cloud-fan added 4 commits May 11, 2020 13:56

fail by default when reading/writing ancient datetime values from/to …

5b5cda2

…Parquet/Avro files

address comments

2915675

fix tests

2bc23b2

fix conflicts

fc31eea

cloud-fan force-pushed the rebase branch from 149fb63 to fc31eea Compare May 11, 2020 06:13

MaxGekk reviewed May 11, 2020

View reviewed changes

simplify code

7392e00

MaxGekk reviewed May 11, 2020

View reviewed changes

MaxGekk approved these changes May 11, 2020

View reviewed changes

address comments

c8920ca

cloud-fan force-pushed the rebase branch from cdcb55a to c8920ca Compare May 11, 2020 09:08

HyukjinKwon reviewed May 12, 2020

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 12, 2020

View reviewed changes

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java Show resolved Hide resolved

HyukjinKwon approved these changes May 12, 2020

View reviewed changes

address comment

f906a54

cloud-fan closed this May 12, 2020

cloud-fan reopened this May 12, 2020

Merge remote-tracking branch 'origin/master' into rebase

eb61edb

HyukjinKwon closed this in fd2d55c May 14, 2020

		SQLConf.LEGACY_PARQUET_REBASE_DATETIME_IN_WRITE.key -> rebase.toString) {
		SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> mode.toString) {

[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files #28477

[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files #28477

Conversation

cloud-fan commented May 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented May 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal commented May 7, 2020

cloud-fan commented May 8, 2020

dilipbiswal commented May 8, 2020

SparkQA commented May 8, 2020

SparkQA commented May 8, 2020

SparkQA commented May 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2020

SparkQA commented May 11, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented May 12, 2020

cloud-fan commented May 12, 2020

SparkQA commented May 12, 2020

HyukjinKwon commented May 12, 2020

SparkQA commented May 12, 2020

cloud-fan commented May 12, 2020

SparkQA commented May 12, 2020

cloud-fan commented May 12, 2020

SparkQA commented May 12, 2020

SparkQA commented May 12, 2020

SparkQA commented May 12, 2020

SparkQA commented May 13, 2020

SparkQA commented May 13, 2020

SparkQA commented May 13, 2020

HyukjinKwon commented May 14, 2020

SparkQA commented May 14, 2020

HyukjinKwon commented May 14, 2020 • edited Loading

HyukjinKwon commented May 14, 2020 • edited Loading

HyukjinKwon commented May 14, 2020

HyukjinKwon commented May 14, 2020 •

edited

Loading

HyukjinKwon commented May 14, 2020 •

edited

Loading