[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

mihailotim-db · 2024-09-10T06:51:12Z

What changes were proposed in this pull request?

Reverting to the old way of handling DeduplicateRelations in order to improve performance. Instead of checking attribute IDs linearly, we use HashSet:contains()

Why are the changes needed?

Improving perfomance

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2024-09-10T12:54:13Z

We need to re-generate the golden files

[info] *** 4 TESTS FAILED ***
[error] Failed: Total 3559, Failed 4, Errors 0, Passed 3555, Ignored 4
[error] Failed tests:
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilityWithStatsSuite
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite

cloud-fan · 2024-09-11T17:58:47Z

thanks, merging to master!

### What changes were proposed in this pull request? Reverting to the old way of handling DeduplicateRelations in order to improve performance. Instead of checking attribute IDs linearly, we use `HashSet:contains()` ### Why are the changes needed? Improving perfomance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ove performance of `DeduplicateRelations` ### What changes were proposed in this pull request? This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance. Additionally, this PR reverts #48053 as that change is no longer needed ### Why are the changes needed? Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Closes #48392 from mihailotim-db/mihailotim-db/master. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Reverting to the old way of handling DeduplicateRelations in order to improve performance. Instead of checking attribute IDs linearly, we use `HashSet:contains()` ### Why are the changes needed? Improving perfomance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ove performance of `DeduplicateRelations` ### What changes were proposed in this pull request? This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance. Additionally, this PR reverts apache#48053 as that change is no longer needed ### Why are the changes needed? Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Closes apache#48392 from mihailotim-db/mihailotim-db/master. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Improve DeduplicateRelations performance

ed544b1

github-actions bot added the SQL label Sep 10, 2024

mihailotim-db changed the title ~~Improve DeduplicateRelations performance~~ [SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance Sep 10, 2024

cloud-fan approved these changes Sep 10, 2024

View reviewed changes

Regenerate golden files

45895ad

cloud-fan closed this in d72e8f9 Sep 11, 2024

mihailotim-db mentioned this pull request Oct 9, 2024

[SPARK-43838][SQL][FOLLOWUP] Replace HashSet with HashMap to improve performance of DeduplicateRelations #48392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

mihailotim-db commented Sep 10, 2024

cloud-fan commented Sep 10, 2024

cloud-fan commented Sep 11, 2024

[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

Conversation

mihailotim-db commented Sep 10, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Sep 10, 2024

cloud-fan commented Sep 11, 2024