[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty #22721

wangyum · 2018-10-15T05:56:44Z

What changes were proposed in this pull request?

We invalidate table relation once table data is changed by SPARK-21237. But there is a situation we have not invalidated(spark.sql.statistics.size.autoUpdate.enabled=false and table.stats.isEmpty):

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

Lines 44 to 54 in 07c4b9b

    
           def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = { 
        
             val catalog = sparkSession.sessionState.catalog 
        
             if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) { 
        
               val newTable = catalog.getTableMetadata(table.identifier) 
        
               val newSize = CommandUtils.calculateTotalSize(sparkSession, newTable) 
        
               val newStats = CatalogStatistics(sizeInBytes = newSize) 
        
               catalog.alterTableStats(table.identifier, Some(newStats)) 
        
             } else if (table.stats.nonEmpty) { 
        
               catalog.alterTableStats(table.identifier, None) 
        
             } 
        
           }

This will introduce some issues, e.g. SPARK-19784, SPARK-19845, SPARK-25403, SPARK-25332 and SPARK-28413.

This is a example to reproduce SPARK-19784:

val path = "/tmp/spark/parquet"
spark.sql("CREATE TABLE t (a INT) USING parquet")
spark.sql("INSERT INTO TABLE t VALUES (1)")
spark.range(5).toDF("a").write.parquet(path)
spark.sql(s"ALTER TABLE t SET LOCATION '${path}'")
spark.table("t").count() // return 1
spark.sql("refresh table t")
spark.table("t").count() // return 5

This PR invalidates the table relation in this case(spark.sql.statistics.size.autoUpdate.enabled=false and table.stats.isEmpty) to fix this issue.

How was this patch tested?

unit tests

SparkQA · 2018-10-15T07:05:01Z

Test build #97371 has finished for PR 22721 at commit 8a7f4af.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-15T07:05:53Z

retest this please

SparkQA · 2018-10-15T07:54:37Z

Test build #97374 has finished for PR 22721 at commit 8a7f4af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-15T08:22:20Z

retest this please

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

SparkQA · 2018-10-15T10:30:19Z

Test build #97377 has finished for PR 22721 at commit 8a7f4af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-15T15:34:03Z

Test build #97387 has finished for PR 22721 at commit 37fed41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

wangyum · 2018-10-17T15:02:06Z

cc @cloud-fan

cloud-fan · 2018-10-17T16:06:22Z

what's the impact to end users? wrong statistics?

wangyum · 2018-10-18T08:04:02Z

The answer is here: #22758 (comment)

…5403

SparkQA · 2018-10-18T12:22:25Z

Test build #97521 has finished for PR 22721 at commit 983c5a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-18T12:24:46Z

I think it's reasonable to follow InsertIntoHiveTable, but it's better to provide more details about what changes in InsertIntoHadoopFsRelationCommand:

what's refreshed? Previously we refreshed the data cache via path, and also refresh the file index. But the plan cache is still there. Now we refresh the plan cache. Since file index exists in the plan, so we don't need to refresh it if we refresh plan cache, but the data cache still needs to be refreshed.
what's the performance impact? plan cache is very useful when reading partitioned tables, to avoid listing files repeatedly. But seems it's OK because we already refresh file index before, so we must re-list files after insertion.

…g.refreshTable

SparkQA · 2018-10-19T07:05:03Z

Test build #97579 has finished for PR 22721 at commit 6c8a73f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-19T07:11:16Z

retest this please

SparkQA · 2018-10-19T10:57:52Z

Test build #97597 has finished for PR 22721 at commit 6c8a73f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-29T04:59:44Z

Retest this please.

SparkQA · 2018-10-29T07:05:01Z

Test build #98181 has finished for PR 22721 at commit 6c8a73f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-29T07:06:35Z

Retest this please.

SparkQA · 2018-10-29T10:37:30Z

Test build #98186 has finished for PR 22721 at commit 6c8a73f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

wangyum · 2019-09-10T02:12:16Z

cc @cloud-fan

HyukjinKwon · 2019-09-18T04:31:46Z

retest this please

SparkQA · 2019-09-18T07:05:02Z

Test build #110863 has finished for PR 22721 at commit c6a1a7d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-09-18T07:23:27Z

retest this please

SparkQA · 2019-09-18T12:34:55Z

Test build #110884 has finished for PR 22721 at commit c6a1a7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-01-05T00:07:41Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

cloud-fan · 2020-01-06T07:06:45Z

does the problem still exist? I think we need to merge this PR.

wangyum · 2020-01-06T07:14:22Z

Yes. It still exist:

scala> spark.version
res0: String = 3.0.0-preview2

scala> val path = "/tmp/spark/parquet"
path: String = /tmp/spark/parquet

scala> spark.sql("CREATE TABLE t (a INT) USING parquet")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("INSERT INTO TABLE t VALUES (1)")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.range(5).toDF("a").write.parquet(path)

scala> spark.sql(s"ALTER TABLE t SET LOCATION '${path}'")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.table("t").count() // return 1
res5: Long = 1

scala> spark.sql("refresh table t")
res6: org.apache.spark.sql.DataFrame = []

scala> spark.table("t").count() // return 5
res7: Long = 5

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

SparkQA · 2020-01-06T07:36:39Z

Test build #116137 has finished for PR 22721 at commit c6a1a7d.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-06T08:00:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

@@ -50,6 +50,9 @@ object CommandUtils extends Logging {
      catalog.alterTableStats(table.identifier, Some(newStats))
    } else if (table.stats.nonEmpty) {
      catalog.alterTableStats(table.identifier, None)
+    } else {
+      // In other cases, we still need to invalidate the table relation cache.


to confirm: does catalog.alterTableStats refresh the relation cache?

Yes.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

Lines 409 to 422 in 1743d5b

/**

* Alter Spark's statistics of an existing metastore table identified by the provided table

* identifier.

*/

def alterTableStats(identifier: TableIdentifier, newStats: Option[CatalogStatistics]): Unit = {

val db = formatDatabaseName(identifier.database.getOrElse(getCurrentDatabase))

val table = formatTableName(identifier.table)

val tableIdentifier = TableIdentifier(table, Some(db))

requireDbExists(db)

requireTableExists(tableIdentifier)

externalCatalog.alterTableStats(db, table, newStats)

// Invalidate the table relation cache

refreshTable(identifier)

}

SparkQA · 2020-01-06T08:05:01Z

Test build #116140 has finished for PR 22721 at commit 817d2de.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-01-06T15:16:08Z

retest this please

SparkQA · 2020-01-06T19:13:47Z

Test build #116179 has finished for PR 22721 at commit 817d2de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-07T03:13:16Z

Hi, @wangyum . According to @nchammas and @HeartSaVioR , it seems that you need to remove the label Stable to prevent the automatic GitHub action bot.

cloud-fan · 2020-01-07T03:41:45Z

thanks, merging to master!

dongjoon-hyun · 2020-01-07T04:12:57Z

Thank you, @wangyum and @cloud-fan .
Can we have this at Apache Spark 2.4.5?

SparkQA · 2020-01-07T07:29:15Z

Test build #116205 has finished for PR 22721 at commit 817d2de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-01-10T15:52:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

@@ -50,6 +50,9 @@ object CommandUtils extends Logging {
      catalog.alterTableStats(table.identifier, Some(newStats))
    } else if (table.stats.nonEmpty) {
      catalog.alterTableStats(table.identifier, None)
+    } else {
+      // In other cases, we still need to invalidate the table relation cache.


Could you explain why we need to refresh table while updating stats, please. For instance, we can do the same work twice. See:

InsertIntoHiveTable:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

Lines 113 to 115 in 157b72a

sparkSession.sessionState.catalog.refreshTable(table.identifier)

CommandUtils.updateTableStats(sparkSession, table)

LoadDataCommand :

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

Lines 394 to 396 in ddc0d51

catalog.refreshTable(targetTable.identifier)

CommandUtils.updateTableStats(sparkSession, targetTable)

AlterTableDropPartitionCommand:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

Lines 586 to 587 in e0e06c1

sparkSession.catalog.refreshTable(table.identifier.quotedString)

CommandUtils.updateTableStats(sparkSession, table)

Not all commands have refresh table logic, such as AlterTableSetLocationCommand:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

Lines 836 to 837 in b77d11d

CommandUtils.updateTableStats(sparkSession, table)

Seq.empty[Row]

Refresh table after insert into table

8a7f4af

wangyum commented Oct 15, 2018

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

Fix test error

37fed41

felixcheung reviewed Oct 16, 2018

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala Outdated Show resolved Hide resolved

wangyum mentioned this pull request Oct 18, 2018

[SPARK-25332][SQL] select broadcast join instead of sortMergeJoin for the small size table even query fired via new session/context #22758

Closed

wangyum added 2 commits October 18, 2018 16:23

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

cc9b4df

…5403

Address comment

983c5a8

sparkSession.catalog.refreshTable -> sparkSession.sessionState.catalo…

6c8a73f

…g.refreshTable

sujith71955 reviewed Nov 8, 2018

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala Outdated Show resolved Hide resolved

wangyum added 2 commits November 12, 2018 13:26

Merge remote-tracking branch 'upstream/master' into SPARK-25403

ebe8276

Fix SPARK-19784

1e62a24

wangyum changed the title ~~[SPARK-25403][SQL] Refreshes the table after inserting the table~~ [SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty Nov 12, 2018

github-actions bot added the Stale label Jan 5, 2020

wangyum closed this Jan 5, 2020

wangyum reopened this Jan 6, 2020

Merge remote-tracking branch 'upstream/master' into SPARK-25403

817d2de

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

cloud-fan reviewed Jan 6, 2020

View reviewed changes

github-actions bot closed this Jan 7, 2020

cloud-fan removed the Stale label Jan 7, 2020

cloud-fan reopened this Jan 7, 2020

cloud-fan closed this in 17881a4 Jan 7, 2020

wangyum deleted the SPARK-25403 branch January 7, 2020 03:46

MaxGekk reviewed Jan 10, 2021

View reviewed changes

	def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
	val catalog = sparkSession.sessionState.catalog
	if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) {
	val newTable = catalog.getTableMetadata(table.identifier)
	val newSize = CommandUtils.calculateTotalSize(sparkSession, newTable)
	val newStats = CatalogStatistics(sizeInBytes = newSize)
	catalog.alterTableStats(table.identifier, Some(newStats))
	} else if (table.stats.nonEmpty) {
	catalog.alterTableStats(table.identifier, None)
	}
	}

	/**
	* Alter Spark's statistics of an existing metastore table identified by the provided table
	* identifier.
	*/
	def alterTableStats(identifier: TableIdentifier, newStats: Option[CatalogStatistics]): Unit = {
	val db = formatDatabaseName(identifier.database.getOrElse(getCurrentDatabase))
	val table = formatTableName(identifier.table)
	val tableIdentifier = TableIdentifier(table, Some(db))
	requireDbExists(db)
	requireTableExists(tableIdentifier)
	externalCatalog.alterTableStats(db, table, newStats)
	// Invalidate the table relation cache
	refreshTable(identifier)
	}

	sparkSession.sessionState.catalog.refreshTable(table.identifier)

	CommandUtils.updateTableStats(sparkSession, table)

	catalog.refreshTable(targetTable.identifier)

	CommandUtils.updateTableStats(sparkSession, targetTable)

	sparkSession.catalog.refreshTable(table.identifier.quotedString)
	CommandUtils.updateTableStats(sparkSession, table)

	CommandUtils.updateTableStats(sparkSession, table)
	Seq.empty[Row]

[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty #22721

[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty #22721

Conversation

wangyum commented Oct 15, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 15, 2018

wangyum commented Oct 15, 2018

SparkQA commented Oct 15, 2018

wangyum commented Oct 15, 2018

SparkQA commented Oct 15, 2018

SparkQA commented Oct 15, 2018

wangyum commented Oct 17, 2018

cloud-fan commented Oct 17, 2018

wangyum commented Oct 18, 2018

SparkQA commented Oct 18, 2018

cloud-fan commented Oct 18, 2018

SparkQA commented Oct 19, 2018

wangyum commented Oct 19, 2018

SparkQA commented Oct 19, 2018

dongjoon-hyun commented Oct 29, 2018

SparkQA commented Oct 29, 2018

wangyum commented Oct 29, 2018

SparkQA commented Oct 29, 2018

wangyum commented Sep 10, 2019

HyukjinKwon commented Sep 18, 2019

SparkQA commented Sep 18, 2019

wangyum commented Sep 18, 2019

SparkQA commented Sep 18, 2019

github-actions bot commented Jan 5, 2020

cloud-fan commented Jan 6, 2020

wangyum commented Jan 6, 2020

SparkQA commented Jan 6, 2020

cloud-fan Jan 6, 2020

Choose a reason for hiding this comment

wangyum Jan 6, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2020

wangyum commented Jan 6, 2020

SparkQA commented Jan 6, 2020

dongjoon-hyun commented Jan 7, 2020

cloud-fan commented Jan 7, 2020

dongjoon-hyun commented Jan 7, 2020

SparkQA commented Jan 7, 2020

MaxGekk Jan 10, 2021

Choose a reason for hiding this comment

wangyum Jan 11, 2021

Choose a reason for hiding this comment

wangyum commented Oct 15, 2018 •

edited

Loading