-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty #22721
Conversation
Test build #97371 has finished for PR 22721 at commit
|
retest this please |
Test build #97374 has finished for PR 22721 at commit
|
retest this please |
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
Test build #97377 has finished for PR 22721 at commit
|
Test build #97387 has finished for PR 22721 at commit
|
...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
Outdated
Show resolved
Hide resolved
cc @cloud-fan |
what's the impact to end users? wrong statistics? |
The answer is here: #22758 (comment) |
Test build #97521 has finished for PR 22721 at commit
|
I think it's reasonable to follow
|
Test build #97579 has finished for PR 22721 at commit
|
retest this please |
Test build #97597 has finished for PR 22721 at commit
|
Retest this please. |
Test build #98181 has finished for PR 22721 at commit
|
Retest this please. |
Test build #98186 has finished for PR 22721 at commit
|
...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
Outdated
Show resolved
Hide resolved
cc @cloud-fan |
retest this please |
Test build #110863 has finished for PR 22721 at commit
|
retest this please |
Test build #110884 has finished for PR 22721 at commit
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
does the problem still exist? I think we need to merge this PR. |
Yes. It still exist: scala> spark.version
res0: String = 3.0.0-preview2
scala> val path = "/tmp/spark/parquet"
path: String = /tmp/spark/parquet
scala> spark.sql("CREATE TABLE t (a INT) USING parquet")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("INSERT INTO TABLE t VALUES (1)")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.range(5).toDF("a").write.parquet(path)
scala> spark.sql(s"ALTER TABLE t SET LOCATION '${path}'")
res4: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").count() // return 1
res5: Long = 1
scala> spark.sql("refresh table t")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").count() // return 5
res7: Long = 5 |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
Test build #116137 has finished for PR 22721 at commit
|
@@ -50,6 +50,9 @@ object CommandUtils extends Logging { | |||
catalog.alterTableStats(table.identifier, Some(newStats)) | |||
} else if (table.stats.nonEmpty) { | |||
catalog.alterTableStats(table.identifier, None) | |||
} else { | |||
// In other cases, we still need to invalidate the table relation cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm: does catalog.alterTableStats
refresh the relation cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Lines 409 to 422 in 1743d5b
/** | |
* Alter Spark's statistics of an existing metastore table identified by the provided table | |
* identifier. | |
*/ | |
def alterTableStats(identifier: TableIdentifier, newStats: Option[CatalogStatistics]): Unit = { | |
val db = formatDatabaseName(identifier.database.getOrElse(getCurrentDatabase)) | |
val table = formatTableName(identifier.table) | |
val tableIdentifier = TableIdentifier(table, Some(db)) | |
requireDbExists(db) | |
requireTableExists(tableIdentifier) | |
externalCatalog.alterTableStats(db, table, newStats) | |
// Invalidate the table relation cache | |
refreshTable(identifier) | |
} |
Test build #116140 has finished for PR 22721 at commit
|
retest this please |
Test build #116179 has finished for PR 22721 at commit
|
Hi, @wangyum . According to @nchammas and @HeartSaVioR , it seems that you need to remove the label |
thanks, merging to master! |
Thank you, @wangyum and @cloud-fan . |
Test build #116205 has finished for PR 22721 at commit
|
@@ -50,6 +50,9 @@ object CommandUtils extends Logging { | |||
catalog.alterTableStats(table.identifier, Some(newStats)) | |||
} else if (table.stats.nonEmpty) { | |||
catalog.alterTableStats(table.identifier, None) | |||
} else { | |||
// In other cases, we still need to invalidate the table relation cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why we need to refresh table while updating stats, please. For instance, we can do the same work twice. See:
- InsertIntoHiveTable:
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
Lines 113 to 115 in 157b72a
sparkSession.sessionState.catalog.refreshTable(table.identifier) CommandUtils.updateTableStats(sparkSession, table) - LoadDataCommand :
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
Lines 394 to 396 in ddc0d51
catalog.refreshTable(targetTable.identifier) CommandUtils.updateTableStats(sparkSession, targetTable) - AlterTableDropPartitionCommand:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
Lines 586 to 587 in e0e06c1
sparkSession.catalog.refreshTable(table.identifier.quotedString) CommandUtils.updateTableStats(sparkSession, table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all commands have refresh table logic, such as AlterTableSetLocationCommand
:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
Lines 836 to 837 in b77d11d
CommandUtils.updateTableStats(sparkSession, table) | |
Seq.empty[Row] |
What changes were proposed in this pull request?
We invalidate table relation once table data is changed by SPARK-21237. But there is a situation we have not invalidated(
spark.sql.statistics.size.autoUpdate.enabled=false
andtable.stats.isEmpty
):spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
Lines 44 to 54 in 07c4b9b
This will introduce some issues, e.g. SPARK-19784, SPARK-19845, SPARK-25403, SPARK-25332 and SPARK-28413.
This is a example to reproduce SPARK-19784:
This PR invalidates the table relation in this case(
spark.sql.statistics.size.autoUpdate.enabled=false
andtable.stats.isEmpty
) to fix this issue.How was this patch tested?
unit tests