Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely? #6841

Closed
Nekotekina opened this issue Apr 23, 2024 · 33 comments · Fixed by #6884
Closed

Comments

@Nekotekina
Copy link
Contributor

Nekotekina commented Apr 23, 2024

Hello. I did some perplexity tests while investigating issue with Llama-3. Initially the issue was that Llama-3 base 70b model outputs garbage with small quants with iMatrix. Can't find any information regarding that ROCm possibly causes corruption.

GPU test (RX 7600 + RX 7600 XT)

https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF/tree/main
Meta-Llama-3-70B.i1-Q2_K.gguf prints [1]-nan,[2]-nan,[3]-nan,[4]-nan with -ngl 30 or 0 (prints garbage unless -ngl 0)
https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF/tree/main
Meta-Llama-3-70B.Q2_K.gguf - seems OK, [1]4.1839,[2]4.7300,[3]4.2751,[4]4.6444,[5]4.6942,[6]5.0426,[7]5.1405,[8]5.4747
Final estimate: PPL = 5.9315 +/- 0.03553

Pure CPU test

Meta-Llama-3-70B.i1-Q2_K.gguf with pure CPU 'perplexity' build (146 seconds per 512 tokens - ETA 26 hours 55.67 minutes)
[1]6.3962,[2]7.1886,[3]6.9886,[4]7.3853,[5]7.8924,[6]8.2982,[7]8.8956,[8]9.3799, (can't wait for many hours, stopped)
Meta-Llama-3-70B.Q2_K.gguf (static Q2_K):
[1]4.1675,[2]4.6952,[3]4.2374,[4]4.6452,[5]4.6677,[6]5.0459,[7]5.1258,[8]5.4649,^C
It's slightly better than on ROCm but the difference is very small.

I also found strange holes in the imatrix.dat that was used:
Screenshot from 2024-04-23 13-34-09
But the author seems uninterested in discussing that.

@slaren
Copy link
Collaborator

slaren commented Apr 23, 2024

I also get nan ppl with that model with CUDA, so it does not seem specific to ROCm.

@Nekotekina
Copy link
Contributor Author

@slaren does the model itself works, in a sense, that it doesn't output garbage? Because that was mine issue with AMD initially.

@slaren
Copy link
Collaborator

slaren commented Apr 23, 2024

If it looks ok with CPU, then it is probably an issue in the CUDA/HIP backend. I don't know why the nans are being generated, we would need to investigate.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Apr 23, 2024

I think this is not an issue with the code but rather with the model from that particular repository. The repository says it took the importance matrix from another repository which in turn says nothing about how the importance matrix was calculated. Using a few hours of my 6x RTX 4090 machine I calculated an importance matrix on 2.6 million tokens of Wikitext 103. The q2_K quant created using that particular importance matrix works fine.

If it looks ok with CPU, then it is probably an issue in the CUDA/HIP backend. I don't know why the nans are being generated, we would need to investigate.

It's because CUDA/HIP hip use IEEE 754 half precision floats which have a reduced numerical range to the single precision floats used on the CPU backend. I very much expect that if the weights were converted to BF16 instead of FP16 then you at least wouldn't get NaN. Alternatively, with LLAMA_CUDA_FORCE_MMQ you may also get non-NaN results since the accumulator is FP32. In any case, look at the perplexity of the quants with/without the importance matrix; clearly there is something wrong with the importance matrix quants even on CPU.

@slaren
Copy link
Collaborator

slaren commented Apr 23, 2024

I already tried disabling FP16 mat mul and LLAMA_CUDA_FORCE_MMQ and it still results in nans. The model is likely corrupted, I think there are nans in the block scales, but I don't know why it doesn't affect the CPU backend.

@JohannesGaessler
Copy link
Collaborator

Anyways, I've uploaded the importance matrix I generated: https://huggingface.co/JohannesGaessler/llama.cpp_importance_matrices

@Nekotekina
Copy link
Contributor Author

Nekotekina commented Apr 23, 2024

Hmm, could this be related?
#6804
I'm not sure where these special tokens are coming from though

@slaren
Copy link
Collaborator

slaren commented Apr 23, 2024

So this has nothing to do with FP16 precision. The block scales of blk.21.ffn_up.weight have nans, which results in nans when performing the matrix multiplication. In the CPU backend, the result is then quantized again to q8_k for the next matrix multiplitcation, which dutifully eats the nans and turns them into zeros. And so it looks like it works somewhat, but it really doesn't.

I don't know how the nans got there in the first place, but the model is not valid.

@JohannesGaessler
Copy link
Collaborator

Good to know, thanks for investigating.

@Nekotekina
Copy link
Contributor Author

Nekotekina commented Apr 24, 2024

Thank you, I managed to generate IQ2_XS and even IQ2_XXS with the imatrix provided by JohannesGaessler. Not sure how bad the perplexity is:
Q2_K without imat: Final estimate: PPL = 5.9315 +/- 0.03553
IQ2_XS: Final estimate: PPL = 6.3082 +/- 0.03853
IQ2_XXS: Final estimate: PPL = 7.2195 +/- 0.04552

@Nekotekina
Copy link
Contributor Author

Nekotekina commented Apr 24, 2024

@slaren you mean that imatrix has NaNs, or the original Llama-3 model?

@slaren
Copy link
Collaborator

slaren commented Apr 24, 2024

I was talking specifically about the Meta-Llama-3-70B.i1-Q2_K.gguf model that produces nans with CUDA and HIP.

@Nekotekina
Copy link
Contributor Author

Nekotekina commented Apr 24, 2024

Ah sorry, I misread your message.

Btw. Wouldn't it make sense to check for NaNs and other bad values in the model/imatrix when they are loaded/generated? I think it'd only take a fraction of second for Q quants but it could be beneficial to prevent spreading broken models?

@slaren
Copy link
Collaborator

slaren commented Apr 24, 2024

Yes, I think that would be good.

@schmorp
Copy link

schmorp commented Apr 27, 2024

Hi, mradermacher here. I was pointed at this PR, and I get a lot of negative feedback because of the misinformation spread here.

First, I don't know where JohannesGaessler got his wrong info from, but the repository of course does/did not say that the imatrix is from another repository - he made this up. The process is also documented (linked from all my repos). Basically, I create imatrices the same way as everybody else, namely by running imatrix from llama.cpp with default parameters, as documented. The imatrix is not taken from another repository, it is created specifically for each model that I quantise. I have never hidden how I do things, and people who asked me about details have always gotten a constructive answer from me. Please, JohannesGaessler, don't just make things up in the future. My desire to contribute here or on hf have been dampened by this PR and the backlash I got as a result of your careless comment.

I have documented that llama.cpp is crashing, for every affected model, and in the metadata of my quants I also added the exact crash reason. For every class of crash I could identify, I dutifully opened an issue (e.g. #6597 #6825 #6067), but there has been little to no interest on the side of llama.cpp to do anything about it or even investigate (there has been some, notably by slaren).

I have since then verified that this is not caused by a hardware issue, by running imatrix generation on multiple different computers, getting the same identical (and problematic) imatrix output.

The fact is, llama.cpp often generates nans during imatrix generation. I have proposed a patch (in 6018) that would catch this in at least some cases, but so far, this was ignored. It also seems clear that my patch does not catch all cases, resulting in reduced quality or broken quants even if imatrix generation and quantization seemingly is successful. llama.cpp failing to generate an imatrix (with my patch) or crashing is an extremely common occurence, not an isolated problem with a few models. Most of the time, these quants are not even being published.

The original reporter here was very rude on huggingface and was hiding his comments, which is the reason I was uninterested in discussing this further with him, not as he claims that I was uninterested in discussiong any problems, as the numerous other community dicussions on my other models show. Especially since llama.cpp already had been notified of these issues and they were documented in the model card.

Dear llama.cpp developers, do not get me wrong - I am not saying you are lazy, doing bad work, don't care, or need to fix bugs or support me in any way. However, it is clear that the imatrix/quant process is very buggy at the moment, either because of bugs in it, or because of bugs in the source models not caught by it, and for whatever (without doubt very good) reasons, there is little activity in getting this fixed or improved.

I am, however, a bit frustrated that I keep opening bug reports and try to help, get mostly ignored, and then some random person can then spread lies about me here, supported (almost certainly unintentionally) by llama.cpp contributors. The very least thing is to either acknowledge the problems and bugs in llama.cpp, or at least investigate things before making conclusions.

And to everybody else who reads this report and then plans to attack me personally, please check your facts before you do so.

I am probably the person doing by far the most quants with llama.cpp on huggingface currently (more than 500TB in the last months), and it is therefore not surprising that I run into a lot of issues. But so far, all these issues have been either bugs in llama.cpp or (less often) problems with the source model. Blaming the messenger is the wrong reaction.

Thanks for your concern.

@JohannesGaessler
Copy link
Collaborator

First, I don't know where JohannesGaessler got his wrong info from, but the repository of course does/did not say that the imatrix is from another repository - he made this up. The process is also documented (linked from all my repos). Basically, I create imatrices the same way as everybody else, namely by running imatrix from llama.cpp with default parameters, as documented. The imatrix is not taken from another repository, it is created specifically for each model that I quantise. I have never hidden how I do things, and people who asked me about details have always gotten a constructive answer from me. Please, JohannesGaessler, don't just make things up in the future. My desire to contribute here or on hf have been dampened by this PR and the backlash I got as a result.

The repository no longer exists, but I thought I had read on its page that the importance matrix was taken from Nous Research. It is definitely possible that I may have mixed something up. In any case, it was a mistake made in good faith.

@schmorp
Copy link

schmorp commented Apr 27, 2024

@JohannesGaessler I don't doubt (and have not doubted) your good faith. All my model cards have essentially the same text with variations (e.g. https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF). The text said "weighted/imatrix quants of (url)" (as in https://huggingface.co/mradermacher/Eurux-8x22b-nca-i1-GGUF for example).

Your comment in itself is harmless, the problem was that it is wrong.

I'm not making you responsible for attacks by other people, of course, not even the original reporter (who clearly did not act in good faith), just pointing out the fallout that has happened because of factually wrong comments here.

The real issue is that the problem is very, very common, affecting a large number of models, and in my book, are all bugs in llama.cpp, either because imatrix generation itself is buggy, or it crashes as a result of overflows etc. in the source models (which I consider a bug, because it should not crash, but that is in the eye of the beholder).

The second issue is that actual bug reports by the person doing quantisation elicit very little response, but this bug report, including negative comments about my work, attract a lot more attraction, and even action. This does make me question whether my bug reports are even of value, and whether my time in reporting these issues is well spent.

It shouldn't be the case that actual volunteer contributors have to deal with this.

Again, I am not blaming any llama.cpp contributor, but a little respect towards the people using it large scale and trying to report issues would also not be misplaced. And possibly more communication. Maybe.

We are all volunteers, with limited time and resources. I shouldn't have to write messages correcting misinformation about my work here of all places, wasting your and my time.

@schmorp
Copy link

schmorp commented Apr 27, 2024

PS: the repository no longer exists because I delete repos that are demonstrated to be broken. I have done this numerous times, and if the OR would actually have discussed the issue with me using his and my evidence, I would have deleted it earlier.

@schmorp
Copy link

schmorp commented Apr 27, 2024

Oh, and an even better example is https://huggingface.co/mradermacher/llama-3-70B-instruct-uncensored-i1-GGUF where I documented that llama.cpp crashes on some quants, and the metadata documents the crash reason:

no_imatrix: 'GGML_ASSERT: llama.cpp/ggml-quants.c:11239: grid_index >= 0'

Models where the imatrix generation aborts because of my patch in #6018) are not even uploaded, or sometimes I use a smaller training set, which often succeeds, but I think it is clear by now that this is no guarantee that llama.cpp generates a correct one.

@JohannesGaessler
Copy link
Collaborator

It's already close to 2 AM where I live but I think the 160K tokens you use as input may simply not be enough. I'll do some related testing tomorrow.

@slaren
Copy link
Collaborator

slaren commented Apr 27, 2024

The main reason this received more attention is because it pointed to a bug in a backend, which is something I am interested on and I can act upon immediately. And of course, base models will always receive more attention than merges, which have very little chance of being useful in the first place. Nonetheless, all bug reports are welcome.

Hopefully the --check-tensors flag added in response to this report, and the additional checks during quantization, will make these errors less likely in the future.

@schmorp
Copy link

schmorp commented Apr 28, 2024

@JohannesGaessler I have never seen imatrix recover from nans just by using more tokens. In fact, fewer tokens have a higher chance of not running into the problem in the first place. E.g. once the perplexity calculation results in nans, imatrix has never recovered for me (and I did many hundred imatrix calculations, quite a few who have failed like this). Honestly, this smells like another ad-hoc claim (and unrelated, I probably live in the same city as you currently, greetings!).

@slaren not sure your comment about merges having little chance of being useful will be well received by the sizable community on hf which creates and uses them (probably quite overrepresented among llama.cpp users), but thanks for stating the llama.cpp position (and priority) about it clearly.

Also, while the checks might make the assertions go away, they still don't diagnose the problem, and do nothing towards fixing them. I totally agree that they are useful, though, but they are mostly just the same crash with different message.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

I only speak for myself. If you want the "llama.cpp position", ask @ggerganov. Please, don't make me have to prefix everything I say with "in my opinion".

The main goal of the checks is to avoid the distributions of plainly wrong models, so that this situation doesn't happen again. It should also help detect when a base model (meaning the source model used to quantize) has issues.

@schmorp
Copy link

schmorp commented Apr 28, 2024

@slaren thanks for the clarification. however, you made a statement of fact, and I have to go by what you actually write, so don't make this my fault somehow :) Anyway, your opinion about priorities is really appreciated.

Regarding the actual problem at hand, the question is whether models that do something (i.e. work, to some degree, with transformers) should be completely refused by llama.cpp. The quants in question here did give totally reasonable answers if you use the correct template, so were obviously not totally broken or unusable, at least when used with cuda or the cpu.

What I am saying is that there is a class of models that would work totally fine, but cannot be converted to gguf anymore.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

Specifically, this model had nan values in the block scales. nan propagate to every operation, so and at the very least this would cause the entire layer with nan values to be useless. In this case, this worked on CPU due to a peculiarity of the implementation that prevents nans from being propagated in some cases, but it didn't work with CUDA. I don't think there is any situation in which a model having nan values is acceptable, it suggests that something went terribly wrong at some point.

@schmorp
Copy link

schmorp commented Apr 28, 2024

@slaren: I think the confusion here is between the source model and the quants (which are also models). Currently, both are being refused. Also, I simply think it goes a bit far to dictate to everybody what is an acceptable output for models that (with transformers, or even broken quants) gives reasonable output. As a concrete example of the second class, I have asked the "unacceptable" quant that this report is about many questions, such as "What is the capital of France", "What is the problem with the definition of the second Bernoulli number?" and got perfectly fine answers (all with cuda on my rtx 4090).

Something might have gone terribly wrong, but that's still simply what llama.cpp does at the moment for some inputs.

And yes, I am mostly concerned about refusing unquantized models (which is happening).

In any case, I am not against better diagnosing, but the check that is implemented does not actually diagnose the problem. It diagnoses what could be a symptom of a problem, but it also might misfire. I am not sure this is automatically a positive thing - sure, some possibly problematic models get refused, but we don't know if this causes actual problems (clearly, it depends on the input, so you are refusing models that might be perfectly fine except for some inputs).

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

There is simply no way that llama.cpp can guarantee that a model with nans will work at all. In most cases it will result in the entire activation to be turned into nan, which then will result in all nans logits. In this specific case and purely by chance, the nan were converted to zero before they had a chance to be propagated. If you really want to do this, your best bet would be to write a tool to filter the nan values out of the base model, and replace them with zeroes or any other finite value.

@JohannesGaessler
Copy link
Collaborator

Using Wikitext-103 train as input and the models that I already have available I am so far not able to provoke imatrix into producing NaN values.

imatrix calculates sums of squares of src1. These sums are stored in float accumulators. Intuitively I would think that the precision/numerical range of float is sufficient.

The data for an importance matrix on disk uses types such as int that do not have a fixed width across platforms. This could in principle cause issues but only if the importance matrix calculation and the quantization were to happen on different machines but I think in that case the program would simply crash.

The real issue is that the problem is very, very common, affecting a large number of models, and in my book, are all bugs in llama.cpp, either because imatrix generation itself is buggy, or it crashes as a result of overflows etc. in the source models (which I consider a bug, because it should not crash, but that is in the eye of the beholder).

Generally speaking, it is not feasible to guarantee good results for a numerical calculation for all input values or to determine whether specific input values will produce bad results.
This is why I suspected that the issue may be the amount of input data: based on just reading the short imatrix algorithm description I thought the process would have poor numerical stability.
Looking at the source code I think this is not the case however.

The second issue is that actual bug reports by the person doing quantisation elicit very little response, but this bug report, including negative comments about my work, attract a lot more attraction, and even action. This does make me question whether my bug reports are even of value, and whether my time in reporting these issues is well spent.

I primarily work on the things that I am interested in/familiar with.
I consider importance matrices a non-essential, nice-to-have addon and I have only today looked at the internal workings of importance matrix calculation.
Issues along the lines of "problem with imatrix" are therefore much less likely to get attention from me compared to "discrepancy between HIPBLAS and CPU" which - depending on the specifics - I would consider a high-priority issue with the things that I have directly worked on.

Sorry, but that's just how it is with an open-source project; you either have to fix things yourself or you have to get a dev to care enough to fix it.

I think the confusion here is between the source model and the quants (which are also models). Currently, both are being refused. Also, I simply think it goes a bit far to dictate to everybody what is an acceptable output for models that (with transformers, or even broken quants) gives reasonable output. As a concrete example of the second class, I have asked the "unacceptable" quant that this report is about many questions, such as "What is the capital of France", "What is the problem with the definition of the second Bernoulli number?" and got perfectly fine answers (all with cuda on my rtx 4090).

Something might have gone terribly wrong, but that's still simply what llama.cpp does at the moment for some inputs.

And yes, I am mostly concerned about refusing unquantized models (which is happening).

In any case, I am not against better diagnosing, but the check that is implemented does not actually diagnose the problem. It diagnoses what could be a symptom of a problem, but it also might misfire. I am not sure this is automatically a positive thing - sure, some possibly problematic models get refused, but we don't know if this causes actual problems (clearly, it depends on the input, so you are refusing models that might be perfectly fine except for some inputs).

I 100% agree with slaren on this, models with NaN values should produce a hard error.
If the models end up producing correct or at least coherent results this is essentially just due to undefined behavior.
And I cannot stress this enough: whatever you do, do not distribute models with known NaN values.

@schmorp
Copy link

schmorp commented Apr 28, 2024

@JohannesGaessler so how would a sum of something + nans turn out into something not a nan, something you claimed would happen? Nobody has seen this happen, and numerically it cannot happen. What is seen is that once the perplexity calculations report a nan, this will never change, even with lots of tokens.

Also, I am not sure why you react so defensively or want to educate me on how open source projects work (which I am well aware of, contributing to free software for more than 30 years now), I have made it clear multiple times that I don't expect you to fix bugs I report, and that we are all volunteers.

What I do expect is that bugs and bug reporters are taken seriously, even if there are no resources to investigate. What you are doing is continuously make up ad-hoc claims that somehow show that I am doing something wrong (e.g. using the wrong imatrix, not using enough tokens), without backing it up. Not a good thing to do.

BTW, I have never distributed models with known nans, but I don't think I want you to order me around like this.

Anyway, what I get from all your responses is that llama.cpp does not care much about the correctness of imatrix quants, and the tendency is to blame the reporter of doing something not right without being able to specify what "right" would be.

I came here to set the record straight, and I will not bother you again with reports from now on. Good luck and have fun.

@JohannesGaessler
Copy link
Collaborator

so how would a sum of something + nans turn out into something not a nan, something you claimed would happen? Nobody has seen this happen, and numerically it cannot happen. What is seen is that once the perplexity calculations report a nan, this will never change, even with lots of tokens.

My understanding is that due to conditional statements involving NaNs being evaluated as false the overall result of matrix multiplications in main can end up as 0 rather than NaN depending on the exact code path.

And my assumption about the issue was that the NaNs are not in the imatrix itself but rather occur when the model is later quantized due to poor numerical stability.
I have never encountered this issue and was unaware that the NaN values are likely already occurring in the perplexity values.
(It could also be due to exponentiation but I think that this is unlikely.)

Also, I am not sure why you react so defensively or want to educate me on how open source projects work (which I am well aware of, contributing to free software for more than 30 years now), I have made it clear multiple times that I don't expect you to fix bugs I report, and that we are all volunteers.

What I do expect is that bugs and bug reporters are taken seriously, even if there are no resources to investigate. What you are doing is continuously make up ad-hoc claims that somehow show that I am doing something wrong (e.g. using the wrong imatrix, not using enough tokens), without backing it up. Not a good thing to do.

I am taking this seriously, that is why I'm trying to reproduce the issue which so far I am not able to.
As I said, things can get fixed if you can get a dev to care and right now I do care.

@sdmorrey
Copy link

Perhaps I'm on the summit of mount stupid here due to Dunning Kruger effects, but from my perspective this issue would be resolved simply by setting NaN to 0 when it's found. The validation phase seems like a good place to do it.

Would it be possible to add it as an option? Maybe a good first issue for an aspiring contributor.

@slaren
Copy link
Collaborator

slaren commented May 17, 2024

I would rather avoid creating bad models in the first place than masking the issue. The validation can be costly and is only run when requested. Also, when using mmap, the model cannot be modified. My view is that if you know what you are doing and you are ok with the nans, you can write a program to replace the nan values yourself with whatever value you want in its place.

During quantization both the source and quant data is validated, so that should prevent creating new models with nans from this point.

@JohannesGaessler
Copy link
Collaborator

Setting any values to 0 is equivalent to just removing them from the model. Experiments have shown that you can remove up to ~10% of model weights at random but this severely degrades quality. Rather than accepting a faulty model with NaNs and producing degraded outputs I think a hard error makes more sense since this is likely not what the user wants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants