[SPARK-43838][SQL][FOLLOWUP] Replace `HashSet` with `HashMap` to improve performance of `DeduplicateRelations` #48392

mihailotim-db · 2024-10-09T09:04:25Z

What changes were proposed in this pull request?

This PR replaces HashSet that is currently used with a HashMap to improve DeduplicateRelations performance.
Additionally, this PR reverts #48053 as that change is no longer needed

Why are the changes needed?

Current implementation doesn't utilize HashSet properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

Does this PR introduce any user-facing change?

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon · 2024-10-09T23:17:14Z

Merged to master.

…ove performance of `DeduplicateRelations` ### What changes were proposed in this pull request? This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance. Additionally, this PR reverts apache#48053 as that change is no longer needed ### Why are the changes needed? Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Closes apache#48392 from mihailotim-db/mihailotim-db/master. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Oct 9, 2024

mihailotim-db force-pushed the mihailotim-db/master branch from dedc8ca to 0a8e039 Compare October 9, 2024 12:51

cloud-fan approved these changes Oct 9, 2024

View reviewed changes

Replace HashSet with HashMap

8219901

mihailotim-db force-pushed the mihailotim-db/master branch from 456e759 to 8219901 Compare October 9, 2024 14:07

Regenerate golden files

f14de97

HyukjinKwon approved these changes Oct 9, 2024

View reviewed changes

HyukjinKwon closed this in f69d03e Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43838][SQL][FOLLOWUP] Replace `HashSet` with `HashMap` to improve performance of `DeduplicateRelations` #48392

[SPARK-43838][SQL][FOLLOWUP] Replace `HashSet` with `HashMap` to improve performance of `DeduplicateRelations` #48392

mihailotim-db commented Oct 9, 2024

HyukjinKwon commented Oct 9, 2024

[SPARK-43838][SQL][FOLLOWUP] Replace HashSet with HashMap to improve performance of DeduplicateRelations #48392

[SPARK-43838][SQL][FOLLOWUP] Replace HashSet with HashMap to improve performance of DeduplicateRelations #48392

Conversation

mihailotim-db commented Oct 9, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Oct 9, 2024

[SPARK-43838][SQL][FOLLOWUP] Replace `HashSet` with `HashMap` to improve performance of `DeduplicateRelations` #48392

[SPARK-43838][SQL][FOLLOWUP] Replace `HashSet` with `HashMap` to improve performance of `DeduplicateRelations` #48392