-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables #31804
Conversation
@@ -0,0 +1,2 @@ | |||
--SET spark.sql.legacy.keepCommandOutputSchema=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be useful to verify the legacy schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it relevant to this PR's isView
column addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code change is related, spark.sql.legacy.keepCommandOutputSchema=true
will omit the isView
column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan @dongjoon-hyun @HyukjinKwon @maropu thanks for the review. |
Kubernetes integration test starting |
@@ -533,7 +534,8 @@ object ShowTableExtended { | |||
AttributeReference("namespace", StringType, nullable = false)(), | |||
AttributeReference("tableName", StringType, nullable = false)(), | |||
AttributeReference("isTemporary", BooleanType, nullable = false)(), | |||
AttributeReference("information", StringType, nullable = false)()) | |||
AttributeReference("information", StringType, nullable = false)(), | |||
AttributeReference("isView", BooleanType, nullable = false)()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it added at the last to reduce the breaking change effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Outdated
Show resolved
Hide resolved
@@ -366,13 +366,13 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager) | |||
partitionSpec @ (None | Some(UnresolvedPartitionSpec(_, _))), | |||
output) => | |||
val newOutput = if (conf.getConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA)) { | |||
assert(output.length == 4) | |||
output.head.withName("database") +: output.tail | |||
assert(output.length == 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this seems inconsistent with the doc. The current document means spark.sql.legacy.keepCommandOutputSchema
means the 3.1 or earlier schema, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a new legacy
conf for this behavior change seems kind of trivial and might bring cognition burdens for users. So the config is reused for now and the doc will be updated if it is the right way to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not reused technically. If we reuse the existing conf, this should be output.length == 4
because it disable this PR and the previous commit simultaneously.
So the config is reused for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be output.length == 4
indeed, this is true. output.head.withName("database") +: output.slice(1, 4)
will cut the isView
off
if (output.size == 4) { | ||
Row(database, tableName, isTemp, isView) | ||
} else { | ||
Row(database, tableName, isTemp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have a test coverage for this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the new show-tables-legacy.sql
will import the corresponding tests to cover. I can add some cases in v1.ShowTablesSuite
if show-tables-legacy.sql
is unintuitive
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -514,7 +514,8 @@ object ShowTables { | |||
def getOutputAttrs: Seq[Attribute] = Seq( | |||
AttributeReference("namespace", StringType, nullable = false)(), | |||
AttributeReference("tableName", StringType, nullable = false)(), | |||
AttributeReference("isTemporary", BooleanType, nullable = false)()) | |||
AttributeReference("isTemporary", BooleanType, nullable = false)(), | |||
AttributeReference("isView", BooleanType, nullable = false)()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new column isView: Boolean
is more efficient, but I'm wondering if a new column tableType: String
is more user-friendly. The value can be TABLE
or VIEW
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, most modern databases store tableType
as a string column in INFORMATION_SCHEMA.TABLES
.
On the reading side, they also return a string value for it.
FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0
Besides, as SHOW TABLES
is not ANSI standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDBC protocol also use table type, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, another benefit for using string is if we decide to subdivide tables to something like SYSTEM TABLE
/BASE TABLE
, xxx TABLE
e.t.c, we won't break then
Kubernetes integration test status failure |
Test build #135958 has finished for PR 31804 at commit
|
Test build #135975 has finished for PR 31804 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Could you add some query examples for this behaviour change in the PR description? Also, I think we need to update the migration guide, too. |
Thanks, @maropu, your suggestions sound good~ |
Test build #136002 has finished for PR 31804 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
retest this please |
Test build #136005 has finished for PR 31804 at commit
|
@@ -22,7 +22,7 @@ log4j.rootLogger=INFO, CA, FA | |||
log4j.appender.CA=org.apache.log4j.ConsoleAppender | |||
log4j.appender.CA.layout=org.apache.log4j.PatternLayout | |||
log4j.appender.CA.layout.ConversionPattern=%d{HH:mm:ss.SSS} %p %c: %m%n | |||
log4j.appender.CA.Threshold = WARN | |||
log4j.appender.CA.Threshold = FATAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oversized log for GA console output cause truncation fo useful test errors, andthe unit-tests.log is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, Is it better to backport this change into the previous branches? cc: @HyukjinKwon @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I guess so. we can mute console output for most modules. we call o.s.Assertions.intercept
frequently which produces a lot of unnecessary error logs.
When we mute them, the error stacktraces for failed tests can still be kept.
I can make a separate PR to fix it if it makes senses to the CCers too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)
These warning messages still can be found in the unit-tests.log
s. I don't see much difference as a warning message can still be simply ignored.
But when we encounter test failures but got omitted by the CI, it is hard for us to locate to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay as long as unit-tests.log
contains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll send a PR then
@@ -510,11 +510,21 @@ case class ShowTables( | |||
override def children: Seq[LogicalPlan] = Seq(namespace) | |||
} | |||
|
|||
object ShowTables { | |||
trait ShowTablesLegacyHelper { | |||
def getOutputAttrs: Seq[Attribute] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good idea that the code to handle this legacy behaviour depends on trait
? If we remove this legacy behaviour in far future and we remove this trait
, the change can lead to binary-incompatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callers are bundled in the catalyst module, I guess it's safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for my late reply...) How about simply inlining getLegacyOutputAttrs
? It seems there are the only two places where getLegacyOutputAttrs
is used, so I'm not sure that we need this trait.
retest this please |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136544 has finished for PR 31804 at commit
|
Kubernetes integration test status failure |
Test build #136543 has finished for PR 31804 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136545 has finished for PR 31804 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine otherwise.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136705 has finished for PR 31804 at commit
|
cc @cloud-fan @HyukjinKwon PTAL, thanks |
Test build #137568 has finished for PR 31804 at commit
|
Test build #137604 has finished for PR 31804 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
SHOW TABLES has an output column
isTemporary
which only indicates whether a view is a local temp view or not, that is not enough for users if they want a pipeline of Spark commands, such asSHOW TABLES foreach { case view => DROP VIEW; case table => DROP TABLE}
Why are the changes needed?
distinguish view and tables
Usually, most modern databases store tableType as a string column in INFORMATION_SCHEMA.TABLES.
On the reading side, they also return a string value for it.
FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0
Besides, as SHOW TABLES is not ANSI-standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent
Does this PR introduce any user-facing change?
yes, show tables and show table extended will have one new column called
tableType
at the endSHOW TABLES ...
SHOW TABLE EXTENDED ...
How was this patch tested?
ut modified and added