-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: MMQ code deduplication + iquant support #8495
CUDA: MMQ code deduplication + iquant support #8495
Conversation
Performance
|
Github doesn't let you create an OP with >= 65536 characters but unless I hit that exact number there seems to be no such limits for comments. And if you hit the limit the site just swallows your post and you get to write it a second time. Good design. |
One of the CI builds fails with
I don't see why this PR would increase the amount of memory used per compilation job so my assumption is that the problem instead has to do with the total number of compilation jobs increasing and the machine not having enough memory to run all of them in parallel. |
I noticed it on Github action. Congrats on the MMQ performance boost for IQ Quants, it's bestial! |
On my desktop machine with up to 32 parallel jobs I don't see a noticeable difference between master and this PR when just manually watching the memory use during the compilation with |
Seems like ci flaked, I re ran it manually and it just passed. I also noticed that the cuda setup took way longer before. |
I think there are issues with iq4_nl with some row sizes. It doesn't fail with multiples of 256, but it does with some multiples of 32.
|
Is this already happening on master? |
With llama-3-405B imminent, I disagree |
I am not able to reproduce the iq4_nl issue by manually editing |
I used this code to generate random cases. It fails very frequently with iq4_nl. for (int i = 0; i < 1000; i++)
for (ggml_type type_a : all_types) {
for (ggml_type type_b : {GGML_TYPE_F32}) {
// m = a rows
// n = b rows
// k = cols
std::uniform_int_distribution<> dist_m(1, 128);
std::uniform_int_distribution<> dist_n(16, 128);
std::uniform_int_distribution<> dist_k(1, 16);
int m = dist_m(rng);
int n = dist_n(rng);
int k = dist_k(rng) * ggml_blck_size(type_a);
test_cases.emplace_back(new test_mul_mat(type_a, type_b, m, n, k, { 1, 1}, {1, 1}));
}
} |
With the provided code the tests already fail on master and notably also on master with |
With the reference C implementation of |
The editorconfig check should be fixed with a rebase to master. |
b13084a
to
5b17b99
Compare
I have not tested with the changes in this PR, but I left the random tests running for a while with the fix in #8549 and found that the tests often fail with m=1 and n=1.
|
Looks like the ci build failures will be a problem. We are clearly running out of memory, when msvc is combined with high amounts of templatisation. |
I can reproduce the tests sometimes failing with m=n=1. However, those test cases are definitely unrelated to any MMQ changes because MMQ is only used for batch sizes > 8. What I think is happening: the CPU and GPU implementations are not 100% identical. Therefore you will get differences between the two sets values that are compared via NMSE. If you assume that the difference per operation follows a Gaussian distribution you can expect the NMSE to scale with |
5b17b99
to
f0f71a5
Compare
What do we do about the CI failing due to running out of memory? I am not familiar with the setup at all so I don't know how to fix it. |
What ever I proposed. Additionally, I thought of a more sophisticated variant of "lower number of parallel build executions": We inject a lock, that only allows 1 template-instantiator compilation at a time using https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_LAUNCHER.html . We detect it by file name or something and use some temporary file as lock. But this is as far as build engineering will go. Another thing would be to experiment with precompiled headers and see how far that brings us. If you want something simple, but slow, we can first invoke cmake to only build the cmake .. -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=ON
+ cmake --build . --config Release -j 1 -t ggml
cmake --build . --config Release -j ${env:NUMBER_OF_PROCESSORS} |
Apart from @Green-Sky's suggestions, maybe we can also try to reduce the parallel jobs by just 1 and see if it fits? cmake --build . --config Release -j $((${env:NUMBER_OF_PROCESSORS}-1)) But we do need some longer term solution - not sure what is the best option of those discussed so far |
Looks like "one less" did not cut it. Try my staged build suggestion, with building ggml with 1 job and the rest with number cores. edit: or did it not update yet? |
Think it hasn't started yet: https://github.com/ggerganov/llama.cpp/actions/runs/10020836844/job/27698859227?pr=8495 |
Looks like it passed, I'm rerunning one of them to get more samples, since rerunning did work before without modifications too. update: successful again and it takes "only" 5-10min longer. |
I will merge this PR in a few hours unless someone has an issue with it. |
I agree that this is likely the case. I ran some tests comparing the CPU, BLAS, dmmv, cuBLAS, and mmvq. Basically: By outliers here I mean some cases where the error is very high. This can be explained by the difference in quantization format of src1 in the CPU and mmvq. Increasing m or n reduces the effect of outliers and produces more stable results. I would expect that the error would disappear almost entirely if the CPU backend and mmvq quantized src1 to the same format. |
We might still want to do what @Green-Sky suggested, combining the "one less" approach: cmake --build . --config Release -j $((${env:NUMBER_OF_PROCESSORS} - 1)) -t ggml
cmake --build . --config Release -j ${env:NUMBER_OF_PROCESSORS} Though I don't expect it to make a significant difference |
Sorry, I'm too tired right now to wait for another round of CI; let's do the fixup in a separate PR. |
* CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build
This PR deduplicates the MMQ CUDA code and adds support for all i-quants other than qi1_m. The deduplication is done by converting the data to q8_0 or q8_1 with a block size of 32 or 16 upon loading. For the int8 tensor core kernels this has turned out to be faster. For q6_K and for the kernels using
__dp4a
this was slower so I left those unchanged. All i-quants other than qi1_m can then be supported by simply writing functions that load the data and convert it to q8. For qi1_m there is no support and frankly I don't think it would be worthwhile to add since the quality degradation of that format is very high.