[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

viirya · 2019-12-22T05:33:13Z

What changes were proposed in this pull request?

This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.

Why are the changes needed?

In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

viirya · 2019-12-22T05:34:01Z

cc @dongjoon-hyun @dbtsai @cloud-fan

MaxGekk

In PR description:
to prune necessary -> to prune unnecessary

viirya · 2019-12-22T17:25:20Z

Thanks! @MaxGekk

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

SparkQA · 2019-12-23T23:11:18Z

Test build #115649 has finished for PR 26978 at commit 296293c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-12-24T01:35:48Z

btw, is there any reason not to support the other project-like plans (e.g., aggregate) for nested column pruning?

viirya · 2019-12-24T04:14:42Z

@maropu I think because nested column pruning is new feature, so some supports are not done yet. We can add more supports later. I thought about it before, but haven't worked on it yet.

maropu · 2019-12-24T04:37:11Z

ah, ok. thanks for the info.

SparkQA · 2019-12-26T07:31:32Z

Test build #115791 has finished for PR 26978 at commit 06d2b80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-10T07:46:44Z

Shall we hold on this PR a little bit until the bug of #24637 is identified and resolved?

viirya · 2020-01-10T07:52:01Z

@dongjoon-hyun Yes, I do think so too. Let's see if we can have more details from @cloud-fan.

cloud-fan · 2020-01-13T14:18:25Z

seem like a merge conflict when we sync with upstream, please go ahead and don't get blocked by me.

dongjoon-hyun · 2020-01-13T20:48:30Z

Oh. Thank you for updating, @cloud-fan !

viirya · 2020-01-13T21:22:06Z

Thanks @cloud-fan!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2020-01-14T05:49:43Z

Test build #116670 has finished for PR 26978 at commit f9abd6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T08:02:43Z

Test build #116690 has started for PR 26978 at commit fd7d9bb.

viirya · 2020-01-14T16:55:02Z

retest this please

SparkQA · 2020-01-14T21:05:52Z

Test build #116719 has finished for PR 26978 at commit fd7d9bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-14T22:01:12Z

Also, cc @dbtsai

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

SparkQA · 2020-01-18T04:04:24Z

Test build #116968 has finished for PR 26978 at commit 19f7cd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-24T22:44:26Z

Test build #117378 has finished for PR 26978 at commit a9f21be.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-25T01:18:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -301,6 +301,38 @@ abstract class SchemaPruningSuite
    checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
  }

+  testSchemaPruning("select explode of nested field of array of struct") {
+    // Config combinations
+    val configs = Seq((true, true), (true, false), (false, true), (false, false))


SparkQA · 2020-01-25T03:11:16Z

Test build #117381 has finished for PR 26978 at commit 35b32ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you so much, @viirya , @cloud-fan , @maropu , @MaxGekk .
(cc @gatorsmile and @dbtsai )

gatorsmile · 2020-02-09T02:47:41Z

    withTable("persisted") {
      val jsonStr = """{
         "items": [
           {"itemId": 1, "itemData": "a"},
           {"itemId": 2, "itemData": "b"}
         ]
        }"""
      val df = spark.read.json(Seq(jsonStr).toDS)
      df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

      val read = spark.table("persisted")
      spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
      read.select(explode_outer($"items.itemId"), $"items.itemData").explain(true)
    }

Try the above example. @dongjoon-hyun @viirya

gatorsmile · 2020-02-09T02:54:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -301,6 +301,38 @@ abstract class SchemaPruningSuite
    checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
  }

+  testSchemaPruning("select explode of nested field of array of struct") {


I think the reason why we did not capture the bug is our tests are not well designed and reviewed.

We have to be super careful when we review the tests and then it will be much easier to find the bugs.

Thanks for catching it and pinging me. Let me look at it.

Opened #27503 to fix it.

…ate without Project This reverts commit a0e63b6. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

dongjoon-hyun · 2020-02-10T04:26:16Z

Thank you, @gatorsmile . I'll be more careful.

…out Project ### What changes were proposed in this pull request? This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it. ### Why are the changes needed? In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes apache#26978 from viirya/SPARK-29721. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ate without Project This reverts commit a0e63b6. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

…ate without Project This reverts commit a0e63b6. ### What changes were proposed in this pull request? This reverts the patch at apache#26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch apache#26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes apache#27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

Prune unnecessary nested fields.

86a4cfa

MaxGekk reviewed Dec 22, 2019

View reviewed changes

dongjoon-hyun added the SQL label Dec 22, 2019

maropu reviewed Dec 23, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Outdated Show resolved Hide resolved

Address comment.

296293c

Should be nestedSchemaPruning.

06d2b80

Fix conflict between nested pruning cases.

f9abd6d

viirya commented Jan 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

Clean code and add test.

fd7d9bb

dongjoon-hyun reviewed Jan 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Show resolved Hide resolved

Add few doc.

19f7cd4

Add more test combination.

a9f21be

viirya added 2 commits January 24, 2020 14:55

Merge remote-tracking branch 'upstream/master' into SPARK-29721

3e4218d

Fix style error.

35b32ec

dongjoon-hyun reviewed Jan 25, 2020

View reviewed changes

dongjoon-hyun approved these changes Jan 25, 2020

View reviewed changes

dongjoon-hyun closed this in a0e63b6 Jan 25, 2020

gatorsmile reviewed Feb 9, 2020

View reviewed changes

viirya mentioned this pull request Feb 9, 2020

Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #27504

Closed

viirya deleted the SPARK-29721 branch December 27, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

viirya commented Dec 22, 2019 •

edited

Loading

viirya commented Dec 22, 2019

MaxGekk left a comment

viirya commented Dec 22, 2019

SparkQA commented Dec 23, 2019

maropu commented Dec 24, 2019

viirya commented Dec 24, 2019

maropu commented Dec 24, 2019

SparkQA commented Dec 26, 2019

dongjoon-hyun commented Jan 10, 2020

viirya commented Jan 10, 2020

cloud-fan commented Jan 13, 2020

dongjoon-hyun commented Jan 13, 2020

viirya commented Jan 13, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

viirya commented Jan 14, 2020

SparkQA commented Jan 14, 2020

dongjoon-hyun commented Jan 14, 2020

SparkQA commented Jan 18, 2020

SparkQA commented Jan 24, 2020

dongjoon-hyun Jan 25, 2020

SparkQA commented Jan 25, 2020

dongjoon-hyun left a comment

gatorsmile commented Feb 9, 2020

gatorsmile Feb 9, 2020

viirya Feb 9, 2020

viirya Feb 9, 2020

dongjoon-hyun commented Feb 10, 2020

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

Conversation

viirya commented Dec 22, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya commented Dec 22, 2019

MaxGekk left a comment

Choose a reason for hiding this comment

viirya commented Dec 22, 2019

SparkQA commented Dec 23, 2019

maropu commented Dec 24, 2019

viirya commented Dec 24, 2019

maropu commented Dec 24, 2019

SparkQA commented Dec 26, 2019

dongjoon-hyun commented Jan 10, 2020

viirya commented Jan 10, 2020

cloud-fan commented Jan 13, 2020

dongjoon-hyun commented Jan 13, 2020

viirya commented Jan 13, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

viirya commented Jan 14, 2020

SparkQA commented Jan 14, 2020

dongjoon-hyun commented Jan 14, 2020

SparkQA commented Jan 18, 2020

SparkQA commented Jan 24, 2020

dongjoon-hyun Jan 25, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

gatorsmile commented Feb 9, 2020

gatorsmile Feb 9, 2020

Choose a reason for hiding this comment

viirya Feb 9, 2020

Choose a reason for hiding this comment

viirya Feb 9, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 10, 2020

viirya commented Dec 22, 2019 •

edited

Loading