[SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables #31804

yaooqinn · 2021-03-11T03:17:53Z

What changes were proposed in this pull request?

SHOW TABLES has an output column isTemporary which only indicates whether a view is a local temp view or not, that is not enough for users if they want a pipeline of Spark commands, such as SHOW TABLES foreach { case view => DROP VIEW; case table => DROP TABLE}

Why are the changes needed?

distinguish view and tables

Usually, most modern databases store tableType as a string column in INFORMATION_SCHEMA.TABLES.

On the reading side, they also return a string value for it.

FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0

Besides, as SHOW TABLES is not ANSI-standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent

Does this PR introduce any user-facing change?

yes, show tables and show table extended will have one new column called tableType at the end

SHOW TABLES ...

# before
struct<namespace,tableName:string,isTemporary:boolean>

#after
struct<namespace:string,tableName:string,isTemporary:boolean,tableType:string>

SHOW TABLE EXTENDED ...

# before
struct<namespace:string,tableName:string,isTemporary:boolean,information:string>

#after
struct<namespace:string,tableName:string,isTemporary:boolean,information:string,tableType:string>

How was this patch tested?

ut modified and added

… tables

yaooqinn · 2021-03-11T04:00:45Z

sql/core/src/test/resources/sql-tests/inputs/show-tables-legacy.sql

@@ -0,0 +1,2 @@
+--SET spark.sql.legacy.keepCommandOutputSchema=true


This might be useful to verify the legacy schema

Is it relevant to this PR's isView column addition?

the code change is related, spark.sql.legacy.keepCommandOutputSchema=true will omit the isView column

For instance,
https://github.com/apache/spark/pull/31804/files#diff-a854c4d2d3a463feef2307548a917bae452850a16de730f8a14f84a4eb79a16fR64

https://github.com/apache/spark/pull/31804/files#diff-b6f30759017988fd0963ce840918f541caf610a1793fa73f17e61b03a2acb797R64

yaooqinn · 2021-03-11T04:01:16Z

cc @cloud-fan @dongjoon-hyun @HyukjinKwon @maropu thanks for the review.

SparkQA · 2021-03-11T04:35:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40542/

dongjoon-hyun · 2021-03-11T04:46:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

@@ -533,7 +534,8 @@ object ShowTableExtended {
    AttributeReference("namespace", StringType, nullable = false)(),
    AttributeReference("tableName", StringType, nullable = false)(),
    AttributeReference("isTemporary", BooleanType, nullable = false)(),
-    AttributeReference("information", StringType, nullable = false)())
+    AttributeReference("information", StringType, nullable = false)(),
+    AttributeReference("isView", BooleanType, nullable = false)())


Is it added at the last to reduce the breaking change effect?

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

dongjoon-hyun · 2021-03-11T04:54:58Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

@@ -366,13 +366,13 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)
        partitionSpec @ (None | Some(UnresolvedPartitionSpec(_, _))),
        output) =>
      val newOutput = if (conf.getConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA)) {
-        assert(output.length == 4)
-        output.head.withName("database") +: output.tail
+        assert(output.length == 5)


Well, this seems inconsistent with the doc. The current document means spark.sql.legacy.keepCommandOutputSchema means the 3.1 or earlier schema, doesn't it?

Introducing a new legacy conf for this behavior change seems kind of trivial and might bring cognition burdens for users. So the config is reused for now and the doc will be updated if it is the right way to go

This is not reused technically. If we reuse the existing conf, this should be output.length == 4 because it disable this PR and the previous commit simultaneously.

So the config is reused for now

this should be output.length == 4

indeed, this is true. output.head.withName("database") +: output.slice(1, 4) will cut the isView off

dongjoon-hyun · 2021-03-11T04:55:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+          if (output.size == 4) {
+            Row(database, tableName, isTemp, isView)
+          } else {
+            Row(database, tableName, isTemp)


Do we still have a test coverage for this line?

yes, the new show-tables-legacy.sql will import the corresponding tests to cover. I can add some cases in v1.ShowTablesSuite if show-tables-legacy.sql is unintuitive

SparkQA · 2021-03-11T05:09:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40543/

SparkQA · 2021-03-11T05:10:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40542/

cloud-fan · 2021-03-11T05:17:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

@@ -514,7 +514,8 @@ object ShowTables {
  def getOutputAttrs: Seq[Attribute] = Seq(
    AttributeReference("namespace", StringType, nullable = false)(),
    AttributeReference("tableName", StringType, nullable = false)(),
-    AttributeReference("isTemporary", BooleanType, nullable = false)())
+    AttributeReference("isTemporary", BooleanType, nullable = false)(),
+    AttributeReference("isView", BooleanType, nullable = false)())


A new column isView: Boolean is more efficient, but I'm wondering if a new column tableType: String is more user-friendly. The value can be TABLE or VIEW.

Usually, most modern databases store tableType as a string column in INFORMATION_SCHEMA.TABLES.

On the reading side, they also return a string value for it.

FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0

Besides, as SHOW TABLES is not ANSI standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent

JDBC protocol also use table type, right?

Yes, another benefit for using string is if we decide to subdivide tables to something like SYSTEM TABLE/BASE TABLE, xxx TABLE e.t.c, we won't break then

SparkQA · 2021-03-11T05:18:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40543/

SparkQA · 2021-03-11T05:24:55Z

Test build #135958 has finished for PR 31804 at commit 2546812.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-11T16:14:02Z

Test build #135975 has finished for PR 31804 at commit 6bc1614.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShowTablesLegacyHelper

SparkQA · 2021-03-11T18:31:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40561/

SparkQA · 2021-03-11T18:59:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40561/

maropu · 2021-03-12T00:41:40Z

Could you add some query examples for this behaviour change in the PR description? Also, I think we need to update the migration guide, too.

yaooqinn · 2021-03-12T01:48:59Z

Could you add some query examples for this behaviour change in the PR description? Also, I think we need to update the migration guide, too.

Thanks, @maropu, your suggestions sound good~

SparkQA · 2021-03-12T13:56:01Z

Test build #136002 has finished for PR 31804 at commit 5e9dbcc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-12T14:40:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40589/

SparkQA · 2021-03-12T15:11:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40589/

yaooqinn · 2021-03-12T15:45:17Z

retest this please

SparkQA · 2021-03-12T18:06:31Z

Test build #136005 has finished for PR 31804 at commit bee8cbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2021-03-12T18:11:57Z

sql/core/src/test/resources/log4j.properties

@@ -22,7 +22,7 @@ log4j.rootLogger=INFO, CA, FA
 log4j.appender.CA=org.apache.log4j.ConsoleAppender
 log4j.appender.CA.layout=org.apache.log4j.PatternLayout
 log4j.appender.CA.layout.ConversionPattern=%d{HH:mm:ss.SSS} %p %c: %m%n
-log4j.appender.CA.Threshold = WARN
+log4j.appender.CA.Threshold = FATAL


oversized log for GA console output cause truncation fo useful test errors, andthe unit-tests.log is enough

If so, Is it better to backport this change into the previous branches? cc: @HyukjinKwon @dongjoon-hyun

yeah, I guess so. we can mute console output for most modules. we call o.s.Assertions.intercept frequently which produces a lot of unnecessary error logs.

When we mute them, the error stacktraces for failed tests can still be kept.

I can make a separate PR to fix it if it makes senses to the CCers too.

But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)

But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)

These warning messages still can be found in the unit-tests.logs. I don't see much difference as a warning message can still be simply ignored.

But when we encounter test failures but got omitted by the CI, it is hard for us to locate to them.

I think it's okay as long as unit-tests.log contains.

Thanks, I'll send a PR then

maropu · 2021-03-13T06:03:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

@@ -510,11 +510,21 @@ case class ShowTables(
  override def children: Seq[LogicalPlan] = Seq(namespace)
 }

-object ShowTables {
+trait ShowTablesLegacyHelper {
+  def getOutputAttrs: Seq[Attribute]


Is this a good idea that the code to handle this legacy behaviour depends on trait? If we remove this legacy behaviour in far future and we remove this trait, the change can lead to binary-incompatibility?

The callers are bundled in the catalyst module, I guess it's safe?

(Sorry for my late reply...) How about simply inlining getLegacyOutputAttrs? It seems there are the only two places where getLegacyOutputAttrs is used, so I'm not sure that we need this trait.

HyukjinKwon · 2021-03-26T02:09:35Z

retest this please

SparkQA · 2021-03-26T04:24:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/

SparkQA · 2021-03-26T04:31:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/

SparkQA · 2021-03-26T05:06:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/

SparkQA · 2021-03-26T05:34:54Z

Test build #136544 has finished for PR 31804 at commit eaba257.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-26T05:44:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/

SparkQA · 2021-03-26T05:53:06Z

Test build #136543 has finished for PR 31804 at commit bee8cbe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-26T06:25:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41129/

SparkQA · 2021-03-26T07:18:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41129/

SparkQA · 2021-03-26T10:36:42Z

Test build #136545 has finished for PR 31804 at commit f50d87b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Looks fine otherwise.

SparkQA · 2021-03-30T09:30:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41286/

SparkQA · 2021-03-30T09:38:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41286/

SparkQA · 2021-03-30T13:37:02Z

Test build #136705 has finished for PR 31804 at commit 3ef5b5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2021-03-31T03:58:14Z

cc @cloud-fan @HyukjinKwon PTAL, thanks

SparkQA · 2021-04-19T09:49:24Z

Test build #137568 has finished for PR 31804 at commit 3ef5b5f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-19T16:08:26Z

Test build #137604 has finished for PR 31804 at commit 3ef5b5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2021-07-29T00:07:40Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

yaooqinn added 3 commits March 11, 2021 11:11

[SPARK-34710][SQL] Add isView for SHOW TABLES to distinguish view and…

587477b

… tables

[SPARK-34710][SQL] Add isView for SHOW TABLES to distinguish view and…

5090900

… tables

[SPARK-34710][SQL] Add isView for SHOW TABLES to distinguish view and…

2546812

… tables

github-actions bot added the SQL label Mar 11, 2021

add legacy test

423796f

yaooqinn commented Mar 11, 2021

View reviewed changes

dongjoon-hyun reviewed Mar 11, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 11, 2021

View reviewed changes

cloud-fan reviewed Mar 11, 2021

View reviewed changes

use string

6bc1614

yaooqinn changed the title ~~[SPARK-34710][SQL] Add isView for SHOW TABLES to distinguish view and tables~~ [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables Mar 11, 2021

nit

d2bb4f9

github-actions bot added CORE PYTHON labels Mar 11, 2021

doc

0946863

github-actions bot added the DOCS label Mar 12, 2021

LOG and fix test

903adc9

yaooqinn commented Mar 12, 2021

View reviewed changes

maropu reviewed Mar 13, 2021

View reviewed changes

Merge branch 'master' into SPARK-34710

eaba257

update golden file

f50d87b

maropu approved these changes Mar 30, 2021

View reviewed changes

address comments

3ef5b5f

github-actions bot added the Stale label Jul 29, 2021

github-actions bot closed this Jul 30, 2021

		@@ -0,0 +1,2 @@
		--SET spark.sql.legacy.keepCommandOutputSchema=true

[SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables #31804

[SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables #31804

Conversation

yaooqinn commented Mar 11, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

SHOW TABLES ...

SHOW TABLE EXTENDED ...

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Mar 11, 2021

SparkQA commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2021

SparkQA commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2021

SparkQA commented Mar 11, 2021

SparkQA commented Mar 11, 2021

SparkQA commented Mar 11, 2021

SparkQA commented Mar 11, 2021

maropu commented Mar 12, 2021

yaooqinn commented Mar 12, 2021

SparkQA commented Mar 12, 2021

SparkQA commented Mar 12, 2021

SparkQA commented Mar 12, 2021

yaooqinn commented Mar 12, 2021

SparkQA commented Mar 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn Mar 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Mar 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

SparkQA commented Mar 26, 2021

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 30, 2021

SparkQA commented Mar 30, 2021

SparkQA commented Mar 30, 2021

yaooqinn commented Mar 31, 2021

SparkQA commented Apr 19, 2021

SparkQA commented Apr 19, 2021

github-actions bot commented Jul 29, 2021

yaooqinn commented Mar 11, 2021 •

edited

Loading

yaooqinn Mar 13, 2021 •

edited

Loading

maropu Mar 13, 2021 •

edited

Loading