-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort experiment #31
Draft
raphlinus
wants to merge
17
commits into
main
Choose a base branch
from
sort
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Sort experiment #31
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Updates dependencies to latest published crates, including wgpu 0.18 and winit 0.28.
Have written reduce and scan but not tested
The shaders are mostly written, with some TODOs, but haven't been tested.
Starting to wire up sizes, buffer bindings, etc., in preparation for actually running the pipeline.
The count stage seems to be generating correct output. Next step is wiring up sums.
It seems like reduce works.
Fix some sizing issues, seems to get to top-level scan correctly now.
The prefix sum stages seem to be generating correct output.
This seems to be a working scatter, which means that the core of the algorithm is done. Also just starting to look at performance characteristics. That's why there's a simpler count stage, the huge shared array seemed to be a problem.
The sort pipeline is wired up, and results are close to being sorted, but there are zero elements in the output.
This sorts up to 2^16, but fails at 2^17.
Multisplit appears to work in isolation, we'll see whether that holds up.
It works (and doesn't seem to have the same problem as the scatter from Fidelity), but seems to be a bit slower than that. Perhaps that can be improved (subgroups would obviously help a lot), and it's also possible it would unlock going to 8 bits per pass.
Just iterate all the keys, it's faster. Also suggests a substantial fraction of all time is going into the ballot.
We can use either 16 or 32 for warp size. The former is faster (on M1 Max). 8 is also a possibility but then the size of the histogram array would exceed the workgroup, so threads would need to deal with multiple histogram values. Quick experiments with ELEMENTS_PER_THREAD show no gains for values other than 4.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This branch contains an experiment in sorting. It is not intended to be merged, but having a draft PR gives the branch a stable identifier.
The tip contains an implementation mostly adapted from FidelityFX sort, but with a version of warp-local multi-split inspired by Onesweep. In all cases, subgroup operations have been replaced by workgroup shared memory. There are numerous checkpoints, including a mostly-working version without the WLMS and closer to the original FidelityFX. Note, however, that this exhibits failures consistent with a missing barrier. The tip appears to pass correctness tests, but none of this has been carefully validated.
Sort throughput is approximately 1G element/s on M1 Max.