Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

Closed
wants to merge 9 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 22, 2019

What changes were proposed in this pull request?

This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.

Why are the changes needed?

In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Dec 22, 2019

cc @dongjoon-hyun @dbtsai @cloud-fan

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR description:
to prune necessary -> to prune unnecessary

@viirya
Copy link
Member Author

viirya commented Dec 22, 2019

Thanks! @MaxGekk

@SparkQA
Copy link

SparkQA commented Dec 23, 2019

Test build #115649 has finished for PR 26978 at commit 296293c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Dec 24, 2019

btw, is there any reason not to support the other project-like plans (e.g., aggregate) for nested column pruning?

@viirya
Copy link
Member Author

viirya commented Dec 24, 2019

@maropu I think because nested column pruning is new feature, so some supports are not done yet. We can add more supports later. I thought about it before, but haven't worked on it yet.

@maropu
Copy link
Member

maropu commented Dec 24, 2019

ah, ok. thanks for the info.

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115791 has finished for PR 26978 at commit 06d2b80.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Shall we hold on this PR a little bit until the bug of #24637 is identified and resolved?

@viirya
Copy link
Member Author

viirya commented Jan 10, 2020

@dongjoon-hyun Yes, I do think so too. Let's see if we can have more details from @cloud-fan.

@cloud-fan
Copy link
Contributor

seem like a merge conflict when we sync with upstream, please go ahead and don't get blocked by me.

@dongjoon-hyun
Copy link
Member

Oh. Thank you for updating, @cloud-fan !

@viirya
Copy link
Member Author

viirya commented Jan 13, 2020

Thanks @cloud-fan!

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116670 has finished for PR 26978 at commit f9abd6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116690 has started for PR 26978 at commit fd7d9bb.

@viirya
Copy link
Member Author

viirya commented Jan 14, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116719 has finished for PR 26978 at commit fd7d9bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Also, cc @dbtsai

@SparkQA
Copy link

SparkQA commented Jan 18, 2020

Test build #116968 has finished for PR 26978 at commit 19f7cd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117378 has finished for PR 26978 at commit a9f21be.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -301,6 +301,38 @@ abstract class SchemaPruningSuite
checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
}

testSchemaPruning("select explode of nested field of array of struct") {
// Config combinations
val configs = Seq((true, true), (true, false), (false, true), (false, false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Jan 25, 2020

Test build #117381 has finished for PR 26978 at commit 35b32ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @viirya , @cloud-fan , @maropu , @MaxGekk .
(cc @gatorsmile and @dbtsai )

@gatorsmile
Copy link
Member

    withTable("persisted") {
      val jsonStr = """{
         "items": [
           {"itemId": 1, "itemData": "a"},
           {"itemId": 2, "itemData": "b"}
         ]
        }"""
      val df = spark.read.json(Seq(jsonStr).toDS)
      df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

      val read = spark.table("persisted")
      spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
      read.select(explode_outer($"items.itemId"), $"items.itemData").explain(true)
    }

Try the above example. @dongjoon-hyun @viirya

@@ -301,6 +301,38 @@ abstract class SchemaPruningSuite
checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
}

testSchemaPruning("select explode of nested field of array of struct") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason why we did not capture the bug is our tests are not well designed and reviewed.

We have to be super careful when we review the tests and then it will be much easier to find the bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it and pinging me. Let me look at it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #27503 to fix it.

gatorsmile pushed a commit that referenced this pull request Feb 10, 2020
…ate without Project

This reverts commit a0e63b6.

### What changes were proposed in this pull request?

This reverts the patch at #26978 based on gatorsmile's suggestion.

### Why are the changes needed?

Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #27504 from viirya/revert-SPARK-29721.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
@dongjoon-hyun
Copy link
Member

Thank you, @gatorsmile . I'll be more careful.

viirya added a commit to viirya/spark-1 that referenced this pull request Feb 10, 2020
…out Project

### What changes were proposed in this pull request?

This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.

### Why are the changes needed?

In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#26978 from viirya/SPARK-29721.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Feb 10, 2020
…ate without Project

This reverts commit a0e63b6.

### What changes were proposed in this pull request?

This reverts the patch at #26978 based on gatorsmile's suggestion.

### Why are the changes needed?

Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #27504 from viirya/revert-SPARK-29721.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ate without Project

This reverts commit a0e63b6.

### What changes were proposed in this pull request?

This reverts the patch at apache#26978 based on gatorsmile's suggestion.

### Why are the changes needed?

Original patch apache#26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#27504 from viirya/revert-SPARK-29721.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
@viirya viirya deleted the SPARK-29721 branch December 27, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants