Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43838][SQL][FOLLOWUP] Replace HashSet with HashMap to improve performance of DeduplicateRelations #48392

Closed

Conversation

mihailotim-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR replaces HashSet that is currently used with a HashMap to improve DeduplicateRelations performance.
Additionally, this PR reverts #48053 as that change is no longer needed

Why are the changes needed?

Current implementation doesn't utilize HashSet properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

Does this PR introduce any user-facing change?

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the SQL label Oct 9, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…ove performance of `DeduplicateRelations`

### What changes were proposed in this pull request?
This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance.
Additionally, this PR reverts apache#48053 as that change is no longer needed

### Why are the changes needed?
Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#48392 from mihailotim-db/mihailotim-db/master.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants