[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps #16781

squito · 2017-02-02T17:04:33Z

What changes were proposed in this pull request?

This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats. If the storage timezone is the same as the session timezone, this conversion is a no-op. When data is read from a hive table, the table property is always respected. This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is).

Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same scheme in HIVE-12767 / HIVE-16231.

The default for Spark remains unchanged; created tables do not include the new table property.

This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied.

How was this patch tested?

Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off).

SparkQA · 2017-02-02T17:23:11Z

Test build #72287 has finished for PR 16781 at commit 223ce2c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-02T19:52:56Z

Test build #72288 has finished for PR 16781 at commit 5b49ae0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T20:51:58Z

Test build #74031 has finished for PR 16781 at commit 2c8a228.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T00:17:29Z

Test build #74042 has finished for PR 16781 at commit f0b89fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zivanfi · 2017-03-07T13:44:39Z

Please update the pull request description, because the one dated Feb 2 does not correspond to the fix any more.

zivanfi · 2017-03-07T13:06:36Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

+    // The conf is sometimes null in tests.
+    String tzString =
+        conf == null ? null : conf.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY());
+    if (tzString == null || tzString == "") {


This is java code, not scala, you probably meant tzString.equals("") instead of tzString == ""

Or even better, isEmpty().

zivanfi · 2017-03-07T13:25:56Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -674,6 +674,12 @@ object SQLConf {
      .stringConf
      .createWithDefault(TimeZone.getDefault().getID())

+  val PARQUET_TABLE_INCLUDE_TIMEZONE =
+    buildConf("spark.sql.session.parquet.timeZone")
+      .doc("""Enables inclusion of parquet timezone property in newly created parquet tables""")


There should be a config option for writing "UTC" to the table property when creating tables, not for writing the local timezone.

zivanfi · 2017-03-07T14:11:36Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

    testParquetHiveCompatibility(
      Row(Seq(Row(1))),
      "ARRAY<STRUCT<array_element: INT>>")
  }
+
+  test(s"SPARK-12297: Parquet Timestamp & Hive timezone") {


I think it would be better to have separate test cases for adjustments when reading, adjustments when writing and setting the table property when creating tables.

squito · 2017-04-18T21:35:20Z

@ueshin thanks for taking a look. Yes, that understanding is correct. Another way to think about it is to compare those same operations with different file formats, eg. textfile. Those work more like parquet does after this patch. I had that explanation in a comment on the jira -- I just updated the jira description to include it.

I'll address your comments, they also are making me take a closer look at a couple of things. I should push an update tomorrow.

squito · 2017-04-20T04:13:15Z

@ueshin I've pushed an update which addresses your comments. I also realized that partitioned tables weren't handled correctly! I fixed that as well.

squito · 2017-04-20T04:17:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+          )
+          Seq(false, true).foreach { vectorized =>
+            withClue(s"vectorized = $vectorized;") {
+              spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized)


I was initially using SQLTestUtils.withSQLConf, but I discovered that it wasn't actually taking any effect. I dunno if that is because TestHiveSingleton does something strange, or maybe I'm doing something else weird in this test by creating many new spark sessions. But I did that because it was the only way I could get the conf changes applied consistently.

Since I am creating new sessions, I don't think this has any risk of a failed test not cleaning and triggering failures in other tests outside of this suite. But it still seems like I might be doing something wrong ...

SparkQA · 2017-04-20T06:36:09Z

Test build #75966 has finished for PR 16781 at commit 44a8bbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

@squito Thank you for working on this!
I checked updates and added some comments.

Btw, can you fix the partitioned tables?

ueshin · 2017-04-20T04:51:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+    // hadoopConf for the Parquet Converters
+    val storageTzKey = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
+    val storageTz = relation.tableMeta.properties.getOrElse(storageTzKey, "")
+    val sessionTz = sparkSession.sessionState.conf.sessionLocalTimeZone


sessionTz isn't used.

ueshin · 2017-04-20T04:54:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+    val sessionTz = sparkSession.sessionState.conf.sessionLocalTimeZone
+    Map(
+      storageTzKey -> storageTz
+    )


Should return Map.empty if the value isn't included to the table properties?

ueshin · 2017-04-20T05:08:14Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+    val schema = StructType(Seq(
+      StructField("display", StringType, true)
+    ))
+    val df = spark.createDataFrame(rowRdd, schema)


We can use val df = desiredTimestampStrings.toDF("display") after import spark.implicits._.

thanks, appreciate the help simplifying this. I had a feeling it was more complex than it needed to be :)

ueshin · 2017-04-20T05:11:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+  // is for various "wall-clock" times in different timezones, and then we can compare against those
+  // in our tests.
+  val originalTz = TimeZone.getDefault
+  val timestampTimezoneToMillis = try {


Shall we initialize this in the block like this?

val timestampTimezoneToMillis = { val originalTz = TimeZone.getDefault try { ... } finally { TimeZone.setDefault(originalTz) } }

ueshin · 2017-04-20T05:33:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+      "UTC" -> "UTC",
+      "LA" -> "America/Los_Angeles",
+      "Berlin" -> "Europe/Berlin"
+    ).foreach { case (tableName, zone) =>


Should be testTimezones.foreach { ...?

ueshin · 2017-04-20T05:54:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

@@ -42,6 +52,15 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
       """.stripMargin)
  }

+  override def afterEach(): Unit = {


Why do we need this?

I was probably just being a little paranoid, perhaps I had missed a withTable somewhere. In the current code, things work just fine if I remove them.

ueshin · 2017-04-20T06:08:06Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

+    *                  the hour.
+   * @param storageTz the timezone which was used to store the timestamp.  This should come from the
+    *                  timestamp table property, or else assume its the same as the sessionTz
+   * @return


Can you also add descriptions for @param binary and @return?

ueshin · 2017-04-20T06:15:23Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+  }
+
+  private def checkHasTz(table: String, tz: Option[String]): Unit = {
+    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))


Should explicitly pass sparkSession and use it here?

ueshin · 2017-04-20T06:21:29Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
+      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
+        val localTz = TimeZone.getDefault()
+        val localTzId = localTz.getID()


localTz and localTzId aren't used.

ueshin · 2017-04-20T06:23:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+        spark.sql(
+          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
+        checkHasTz(baseTable, Some("America/Los_Angeles"))
+        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")


nit: remove extra white space, and two more below this.

squito · 2017-04-25T16:49:37Z

@ueshin updated per your feedback.

I should have explained that the last update did handle partition tables (it added the second call to getStorageTzOptions in HiveMetastoreCatalog), though I didn't have any tests for it. It took me a while to figure out how to do it, but this update does include tests for creating partitioned tables and reading from them. (the tests are becoming huge, but I think its worth testing all of the permutations.)

SparkQA · 2017-04-25T19:04:37Z

Test build #76141 has finished for PR 16781 at commit e31657a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-04-26T00:53:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala


-class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton {
+class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton
+    with BeforeAndAfterEach {


We don't need BeforeAndAfterEach anymore.

ueshin · 2017-04-26T01:04:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+                         |  display string,
+                         |  ts timestamp
+                         |)
+                         |PARTITIONED BY (id bigint)


We should test for the partitioned table like PARTITIONED BY (ts timestamp)?

squito · 2017-05-02T18:55:18Z

@ueshin sorry it took me a while to figure out how a table partitioned by timestamps work (I didn't even realize that was possible, I don't think it is in hive?) and I was traveling.

The good news is that partitioning by timestamp works just fine. Since the ts is stored as a string anyway, and converted using the session tz already, it already works. I added one minimal test on this -- when the partitioned table is written, the correct partition dirs are created regardless of the timezone combinations.

In particular, it doesn't make sense to do tests like the existing ones, where we write or read "unadjusted" data, bypassing the hive tables, and then make sure the right adjustments are applied when you perform the reverse action via the hive table; the partition values are correct whether you use the hive table & adjustment property or not.

Let me know if you think more tests are required.

SparkQA · 2017-05-02T21:15:41Z

Test build #76391 has finished for PR 16781 at commit fc17a2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton

ueshin

@squito Thank you for working on this!
The behavior looks good for me.
I left some minor comments.
Thanks!

ueshin · 2017-05-03T05:08:08Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.parquet.hadoop.ParquetFileReader
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName
+import org.scalatest.BeforeAndAfterEach


nit: unnecessary import.

ueshin · 2017-05-03T06:11:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+        val defaultTz = None
+        // check that created tables have correct TBLPROPERTIES
+        val tblProperties = explicitTz.map {
+          tz => raw"""TBLPROPERTIES ($key="$tz")"""


Let's use s"" instead of raw"" if possible. And also elsewhere in the same way.

ueshin · 2017-05-03T06:19:17Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+  val timestampTimezoneToMillis = {
+    val originalTz = TimeZone.getDefault
+    try {
+      (for {


Let's use flatMap { .. map { ... } }.

ueshin · 2017-05-03T06:28:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -29,6 +29,8 @@ import org.apache.spark.sql.catalyst.{QualifiedTableName, TableIdentifier}
 import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.internal.SQLConf


nit: unnecessary import.

ueshin · 2017-05-03T06:29:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+          tz => raw"""TBLPROPERTIES ($key="$tz")"""
+        }.getOrElse("")
+
+


nit: remove extra line.

ueshin · 2017-05-03T06:31:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

+      baseTable: String,
+      explicitTz: Option[String],
+      sessionTzOpt: Option[String]): Unit = {
+      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY


nit: indent

SparkQA · 2017-05-03T17:46:02Z

Test build #76419 has finished for PR 16781 at commit 2537437.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-05-08T00:04:21Z

Jenkins, retest this please.

ueshin · 2017-05-08T00:42:11Z

LGTM, pending Jenkins.

SparkQA · 2017-05-08T02:36:25Z

Test build #76556 has finished for PR 16781 at commit 2537437.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-05-08T03:16:25Z

Thanks! Merging to master.

squito · 2017-05-08T14:54:37Z

great! thanks @ueshin

rxin · 2017-05-09T18:14:50Z

Did we conduct any performance tests on this patch?

## What changes were proposed in this pull request? This change allows timestamps in parquet-based hive table to behave as a "floating time", without a timezone, as timestamps are for other file formats. If the storage timezone is the same as the session timezone, this conversion is a no-op. When data is read from a hive table, the table property is *always* respected. This allows spark to not change behavior when reading old data, but read newly written data correctly (whatever the source of the data is). Spark inherited the original behavior from Hive, but Hive is also updating behavior to use the same scheme in HIVE-12767 / HIVE-16231. The default for Spark remains unchanged; created tables do not include the new table property. This will only apply to hive tables; nothing is added to parquet metadata to indicate the timezone, so data that is read or written directly from parquet files will never have any conversions applied. ## How was this patch tested? Added a unit test which creates tables, reads and writes data, under a variety of permutations (different storage timezones, different session timezones, vectorized reading on and off). Author: Imran Rashid <irashid@cloudera.com> Closes apache#16781 from squito/SPARK-12297.

squito added 12 commits February 2, 2017 10:17

very basic test for adjusting read parquet data

53d0744

wip

69a3c8c

working version for non-vectorized read -- lots of garbage too

51e24f2

working for vectorized reads -- not sure about all code paths

7e61841

more tests for write path

9fbde13

expand tests; fix some metastore interaction; cleanup a lot of garbage

bac9eb0

more cleanup

1b05978

handle bad timezones; include unit test

b622d27

write support; lots more unit tests

0604403

add tests for alter table

f45516d

utc or gmt; cleanup

d4511a6

more cleanup

223ce2c

fix compatibility

5b49ae0

squito closed this Feb 2, 2017

squito added 7 commits March 1, 2017 13:31

Merge branch 'master' into SPARK-12297

9ef60a4

Merge branch 'master' into SPARK-12297

0b6883c

wip

69b8142

fix

7ca2c86

fixes; passes tests now

6f982d3

Merge branch 'master' into SPARK-12297

1ad2f83

fix merge

2c8a228

squito reopened this Mar 6, 2017

fix

f0b89fd

zivanfi reviewed Mar 7, 2017

View reviewed changes

squito added 2 commits April 19, 2017 15:47

partial review feedback

e4e88a5

better param names and docs

44a8bbb

squito commented Apr 20, 2017

View reviewed changes

ueshin reviewed Apr 20, 2017

View reviewed changes

review feedback

e31657a

ueshin reviewed Apr 26, 2017

View reviewed changes

squito added 4 commits May 2, 2017 09:55

Merge branch 'master' into SPARK-12297

d4ff9fd

add check for partitioned tables

acc72ea

fix typo

b9c03e9

review feedback

fc17a2e

ueshin reviewed May 3, 2017

View reviewed changes

review feedback

2537437

asfgit closed this in 2269155 May 8, 2017

squito mentioned this pull request Nov 7, 2017

[SPARK-12297] Table timezone correction for Timestamps #19250

Closed

[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps #16781

[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps #16781

Conversation

squito commented Feb 2, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 2, 2017

SparkQA commented Feb 2, 2017

SparkQA commented Mar 6, 2017

SparkQA commented Mar 7, 2017

zivanfi commented Mar 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Apr 18, 2017

squito commented Apr 20, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 20, 2017

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Apr 25, 2017

SparkQA commented Apr 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented May 2, 2017

SparkQA commented May 2, 2017

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 3, 2017

ueshin commented May 8, 2017

ueshin commented May 8, 2017

SparkQA commented May 8, 2017

ueshin commented May 8, 2017

squito commented May 8, 2017

rxin commented May 9, 2017

squito commented Feb 2, 2017 •

edited

Loading