Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl: Use syclcompat::dp4a #10267

Merged
merged 5 commits into from
Nov 15, 2024
Merged

Conversation

Rbiessy
Copy link
Collaborator

@Rbiessy Rbiessy commented Nov 12, 2024

Using the syclcompat version allow the compiler to optimize the operation with native function. This was tested with #10266.
Setting sm_80 with this change improves the t/s of TG by 28% with the 70B model. I have not measured any performance difference on PVC.

* Using the syclcompat version allow the compiler to optimize the
  operation with native function
@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Nov 12, 2024
@Alcpz
Copy link
Collaborator

Alcpz commented Nov 12, 2024

I think it could be useful to add a note in the SYCL readme pointing out that syclcompat is only supported from the compiler 2025.0 release. Maybe an entry in the news or in the required release section.
The changes LGTM wrt the Nvidia backend.

@airMeng / @NeoZhangJianyu, Are we maintaining support for earlier oneAPI versions? I see that the CI is configured for 2024.1, so I wonder if we have to provide support for earlier versions at least for a bit or if we have to update the compiler version of the runner.

@NeoZhangJianyu
Copy link
Collaborator

The oneAPI is updated to 2025.0 in official release.
I think we don't support/maintain for old oneAPI.

Could you update the SYCL CI yaml file to use oneAPI 2025.0?
So that the CI for SYCL is passed.

Thank you!

uint32_t, int32_t>;

template <typename T1, typename T2, typename T3>
inline auto dp4a(T1 a, T2 b, T3 c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest replacing the dp4a() implementation by syclcompat::dp4a().

  1. no code change in other modules.
  2. easy to optimize for different cases in future if needed.

Copy link
Collaborator

@Alcpz Alcpz Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried this approach some time ago in a different PR, but it was closed because faster implementations requires asm and intrinsics for every backend, and we agreed to limit ourselves to pure SYCL code. Right now, there is no way to get visibility of int intrinsics (dp4a equivalents), and the syclcompat layer shipped as part of oneAPI is trying to bridge that (and other gaps) until they are made avialable through SYCL or an extension. With this approach, backend specific improvements are removed from the app itself.

do you think we could use this PR to agree what to do with regards to syclcompat? The main problem is that dp4a is a major performance gap with other backends due to the software implementation.

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didn't clarify my idea.
I means the dpct::dp4a() call syclcompat::dp4a() directly.
In other models, they still call dpct::dp4a(). But the code path will be forward to syclcompat::dp4a().

Because there is no test data for Intel GPU. If it's bad, we can add code branch in dpct::dp4a() for Intel GPU with old code.

If all models call syclcompat::dp4a() directly as this PR, it's complex to implement for more branches case.

Copy link
Collaborator

@Alcpz Alcpz Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful of branching inside dp4a though, as we would introduce branching inside the kernels. Thanks for the clarification!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we don't add any branching I'm fine with wrapping syclcompat::dp4a inside dpct::dp4a. This is done in 3eff3c3. I hope this is what you meant.

@github-actions github-actions bot added documentation Improvements or additions to documentation devops improvements to build systems and github actions labels Nov 13, 2024
@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Nov 13, 2024

I think it could be useful to add a note in the SYCL readme pointing out that syclcompat is only supported from the compiler 2025.0 release. Maybe an entry in the news or in the required release section. The changes LGTM wrt the Nvidia backend.

Thanks I have updated the news section in b665ffd. The "recommended release" section above looks a bit outdated now but I'm not confident enough to update it as part of this PR.

Could you update the SYCL CI yaml file to use oneAPI 2025.0?

Done in ee76375, hopefully this is the right link.

@@ -41,6 +41,8 @@ The following release is verified with good quality:

## News

- 2024.11
- Use syclcompat to improve the performance on some backends. This requires to use oneAPI 2025.0 or more recent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some backends" is strange to SYCL backend.
Maybe "some platforms" or “some GPUs".

"more recent" -> "newer"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this in 1c16516

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test it on Intel Arc 770.
The performance is not impacted.

I think this PR is good for the first step to use new lib "syclcompat".
We hope all platforms get benefit from it in the future.

And the oneAPI 2025.0 is updated to CI.
Now, SYCL backend is moved to support 2025.0. Old oneAPI won't be recommended.

Thank you!

@airMeng airMeng merged commit 5a54af4 into ggerganov:master Nov 15, 2024
53 checks passed
@ggerganov
Copy link
Owner

@Rbiessy Rbiessy deleted the romain/use_syclcompat_dp4a branch November 15, 2024 09:41
@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Nov 15, 2024

Should fix the CI: https://github.com/ggerganov/llama.cpp/actions/runs/11851871832/job/33029080834#step:8:352

Sorry about that, it seems the docker images also need to be updated to use DPC++ 2025.0. I will look into that.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 17, 2024
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions documentation Improvements or additions to documentation SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants