Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tuning to reduction kernels and improve tuning #991

Closed
wants to merge 1 commit into from

Conversation

upsj
Copy link
Member

@upsj upsj commented Mar 21, 2022

This uses the tuning parameter #692 to enable tuning the oversubscription parameter for kernel reductions, and adds vendor BLAS reductions to the benchmark

TODO:

  • Tune CUDA
    • Pascal
    • Volta
    • Turing
    • Ampere
  • Tune ROCm
    • Radeon VII
    • MI100
  • Tune DPC++
    • CPU
    • DG1
    • ...
  • Tune OpenMP?
  • Tune multiple RHS

@upsj upsj self-assigned this Mar 21, 2022
@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Mar 21, 2022
@upsj upsj added this to the Ginkgo 1.5.0 milestone Mar 21, 2022
@ginkgo-bot ginkgo-bot added mod:all This touches all Ginkgo modules. reg:benchmarking This is related to benchmarking. type:matrix-format This is related to the Matrix formats labels Mar 21, 2022
@upsj
Copy link
Member Author

upsj commented Mar 21, 2022

Some first results here from our Titan X vs. cuBLAS
Tuning parameter is the oversubscription, i.e. number of launched warps / max number of active warps

tuning-blas

For small inputs, not oversubscribing at all gives the best results (up to roughly 10k elements), then cuBLAS starts to take over. The performance for larger oversubscription counts have a bit of a dip all at the same input size that should be easy to eliminate with this knowledge (allocation also probably plays a role). For larger inputs, the more we oversubscribe, the more we win. So I think in general, this shows that we can be on par with cuBLAS, just need to tweak the parameters a bit. sp_* is cuBLAS, the rest is ours

EDIT: Note that I am using logscale here, so small differences may not be visible

@upsj upsj changed the title Add tuning to reduction kernels Add tuning to reduction kernels and improve tuning Mar 21, 2022
@upsj upsj added 1:ST:WIP This PR is a work in progress. Not ready for review. and removed 1:ST:ready-for-review This PR is ready for review labels Mar 21, 2022
@upsj upsj force-pushed the reduction_tuning branch from d759be7 to 22c3184 Compare April 11, 2022 14:17
@upsj upsj force-pushed the reduction_tuning branch from 22c3184 to 314ab2d Compare April 11, 2022 14:23
@ginkgo-bot
Copy link
Member

Error: The following files need to be formatted:

common/cuda_hip/base/kernel_launch_reduction.hpp.inc

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

@tcojean tcojean removed this from the Ginkgo 1.5.0 milestone Oct 20, 2022
@upsj
Copy link
Member Author

upsj commented Nov 21, 2023

This deserves some more attention beyond just playing around with tuning parameters. I'll close it for now

@upsj upsj closed this Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:WIP This PR is a work in progress. Not ready for review. mod:all This touches all Ginkgo modules. reg:benchmarking This is related to benchmarking. type:matrix-format This is related to the Matrix formats
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants