Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Understand why sort aggs are faster than hash aggs and ideally put it to good use #8620

Open
revans2 opened this issue Jun 27, 2023 · 0 comments
Labels
feature request New feature or request performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Jun 27, 2023

Is your feature request related to a problem? Please describe.
#8618 adds in a heuristic to do a sort based agg in some cases. As a part of my testing I found a number of cases (as small as 16 decimal sum aggregations) where the sort based aggregation was faster than the hash based version. I really would like to understand why this is happening and ideally once we know that design/extend the heuristic in #8618 to take advantage of this so that we can speed up large numbers of aggregations.

A few things to note.

  1. Heuristic to speed up partial aggregates that get larger #8618 only is doing this for partial aggregations. This could speed up all aggregations so if we see that it looks good we should apply it to all types of aggregations.
  2. We don't necessarily have to sort all of the input data. We could sort each batch individually instead. Would be good to see if there is a big improvement in having the full sort or not.
  3. Spark already falls back to sort based aggs for a lot of cases. There are only a handful of aggs that are hash based https://github.com/rapidsai/cudf/blob/aed7174eae6c6eb38fbf186938df44f88787cf29/cpp/src/groupby/hash/groupby.cu#L83-L94 and even then the data being worked on has to be "atomic" really 64-bits or smaller.
@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 27, 2023
@mattahrens mattahrens added performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify labels Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants