-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sycl: Use syclcompat::dp4a #10267
sycl: Use syclcompat::dp4a #10267
Conversation
* Using the syclcompat version allow the compiler to optimize the operation with native function
I think it could be useful to add a note in the SYCL readme pointing out that @airMeng / @NeoZhangJianyu, Are we maintaining support for earlier oneAPI versions? I see that the CI is configured for 2024.1, so I wonder if we have to provide support for earlier versions at least for a bit or if we have to update the compiler version of the runner. |
The oneAPI is updated to 2025.0 in official release. Could you update the SYCL CI yaml file to use oneAPI 2025.0? Thank you! |
uint32_t, int32_t>; | ||
|
||
template <typename T1, typename T2, typename T3> | ||
inline auto dp4a(T1 a, T2 b, T3 c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest replacing the dp4a() implementation by syclcompat::dp4a().
- no code change in other modules.
- easy to optimize for different cases in future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried this approach some time ago in a different PR, but it was closed because faster implementations requires asm and intrinsics for every backend, and we agreed to limit ourselves to pure SYCL code. Right now, there is no way to get visibility of int intrinsics (dp4a equivalents), and the syclcompat layer shipped as part of oneAPI is trying to bridge that (and other gaps) until they are made avialable through SYCL or an extension. With this approach, backend specific improvements are removed from the app itself.
do you think we could use this PR to agree what to do with regards to syclcompat? The main problem is that dp4a is a major performance gap with other backends due to the software implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I didn't clarify my idea.
I means the dpct::dp4a() call syclcompat::dp4a() directly.
In other models, they still call dpct::dp4a(). But the code path will be forward to syclcompat::dp4a().
Because there is no test data for Intel GPU. If it's bad, we can add code branch in dpct::dp4a() for Intel GPU with old code.
If all models call syclcompat::dp4a() directly as this PR, it's complex to implement for more branches case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to be careful of branching inside dp4a though, as we would introduce branching inside the kernels. Thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we don't add any branching I'm fine with wrapping syclcompat::dp4a
inside dpct::dp4a
. This is done in 3eff3c3. I hope this is what you meant.
Thanks I have updated the news section in b665ffd. The "recommended release" section above looks a bit outdated now but I'm not confident enough to update it as part of this PR.
Done in ee76375, hopefully this is the right link. |
docs/backend/SYCL.md
Outdated
@@ -41,6 +41,8 @@ The following release is verified with good quality: | |||
|
|||
## News | |||
|
|||
- 2024.11 | |||
- Use syclcompat to improve the performance on some backends. This requires to use oneAPI 2025.0 or more recent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"some backends" is strange to SYCL backend.
Maybe "some platforms" or “some GPUs".
"more recent" -> "newer"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded this in 1c16516
This reverts commit 90cb61d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test it on Intel Arc 770.
The performance is not impacted.
I think this PR is good for the first step to use new lib "syclcompat".
We hope all platforms get benefit from it in the future.
And the oneAPI 2025.0 is updated to CI.
Now, SYCL backend is moved to support 2025.0. Old oneAPI won't be recommended.
Thank you!
Sorry about that, it seems the docker images also need to be updated to use DPC++ 2025.0. I will look into that. |
* sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit 90cb61d.
* sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit 90cb61d.
* sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit 90cb61d.
Using the syclcompat version allow the compiler to optimize the operation with native function. This was tested with #10266.
Setting
sm_80
with this change improves the t/s of TG by 28% with the 70B model. I have not measured any performance difference on PVC.