-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485
Conversation
75950a1
to
3aa0438
Compare
case _ => | ||
val nameToAttr = relation.output.map(_.name).zip(relation.output).toMap | ||
val newOutput = reader.readSchema().map(_.name).map(nameToAttr) | ||
relation.copy(output = newOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue This is the bug I mentioned before. Finally I figured out a way to fix it surgically: always run column pruning even no column needs to be pruned. This helps us correct the required schema of the reader, if it's updated by someone else.
Test build #86979 has finished for PR 20485 at commit
|
Test build #86981 has finished for PR 20485 at commit
|
To be clear, the purpose of this commit, like #20476, is just to get something working for the 2.3.0 release? I just want to make sure since I think we should be approaching these problems with a better initial design for the integration. I'm fine getting this in to unblock a release, but if it isn't for that purpose then I think we should fix the design problems first. |
@rdblue This is another bug we found during the code review. The goal is to ensure Data Source API V2 is usable with at least the same feature sets as Data source API V1. After getting more feedbacks about Data Source API V2 from the community, we will restart the discussion about the data source API design in the next release. |
Sounds fine to me, then. My focus is on the long-term design issues. I still think that the changes to make plans immutable and to use the existing push-down code as much as possible is the best way to get a reliable 2.3.0, but it is fine if they don't make the release. |
jenkins retest this please |
Test build #86999 has finished for PR 20485 at commit
|
retest this please |
Test build #87013 has finished for PR 20485 at commit
|
retest this please |
Test build #87086 has finished for PR 20485 at commit
|
@gatorsmile @rdblue please review and LGTM this. This will unblock my PR - #20445 |
|
||
case relation: DataSourceV2Relation => relation.reader match { | ||
case reader: SupportsPushDownRequiredColumns => | ||
// TODO: Enable the below assert after we make `DataSourceV2Relation` immutable. Fow now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Fow
LGTM Thanks! Merged to master/2.3 |
…e v2 related plans ## What changes were proposed in this pull request? `DataSourceV2Relation` keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e. ``` lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => fullOutput.find(_.name == name).get } ``` This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42 To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning. ## How was this patch tested? a new test case Author: Wenchen Fan <wenchen@databricks.com> Closes #20485 from cloud-fan/canonicalize. (cherry picked from commit b96a083) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…e v2 related plans ## What changes were proposed in this pull request? `DataSourceV2Relation` keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e. ``` lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => fullOutput.find(_.name == name).get } ``` This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42 To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning. ## How was this patch tested? a new test case Author: Wenchen Fan <wenchen@databricks.com> Closes apache#20485 from cloud-fan/canonicalize.
What changes were proposed in this pull request?
DataSourceV2Relation
keeps afullOutput
and resolves the real output on demand by column name lookup. i.e.This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42
To fix this,
DataSourceV2Relation
should just keepoutput
, and update theoutput
when doing column pruning.How was this patch tested?
a new test case