Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance #48053

Conversation

mihailotim-db
Copy link
Contributor

What changes were proposed in this pull request?

Reverting to the old way of handling DeduplicateRelations in order to improve performance. Instead of checking attribute IDs linearly, we use HashSet:contains()

Why are the changes needed?

Improving perfomance

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 10, 2024
@mihailotim-db mihailotim-db changed the title Improve DeduplicateRelations performance [SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance Sep 10, 2024
@cloud-fan
Copy link
Contributor

We need to re-generate the golden files

[info] *** 4 TESTS FAILED ***
[error] Failed: Total 3559, Failed 4, Errors 0, Passed 3555, Ignored 4
[error] Failed tests:
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilityWithStatsSuite
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d72e8f9 Sep 11, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Reverting to the old way of handling DeduplicateRelations in order to improve performance.  Instead of checking attribute IDs linearly, we use `HashSet:contains()`

### Why are the changes needed?
Improving perfomance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Oct 9, 2024
…ove performance of `DeduplicateRelations`

### What changes were proposed in this pull request?
This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance.
Additionally, this PR reverts #48053 as that change is no longer needed

### Why are the changes needed?
Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes #48392 from mihailotim-db/mihailotim-db/master.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
Reverting to the old way of handling DeduplicateRelations in order to improve performance.  Instead of checking attribute IDs linearly, we use `HashSet:contains()`

### Why are the changes needed?
Improving perfomance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…ove performance of `DeduplicateRelations`

### What changes were proposed in this pull request?
This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance.
Additionally, this PR reverts apache#48053 as that change is no longer needed

### Why are the changes needed?
Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#48392 from mihailotim-db/mihailotim-db/master.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants