-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely? #6841
Comments
I also get |
@slaren does the model itself works, in a sense, that it doesn't output garbage? Because that was mine issue with AMD initially. |
If it looks ok with CPU, then it is probably an issue in the CUDA/HIP backend. I don't know why the nans are being generated, we would need to investigate. |
I think this is not an issue with the code but rather with the model from that particular repository. The repository says it took the importance matrix from another repository which in turn says nothing about how the importance matrix was calculated. Using a few hours of my 6x RTX 4090 machine I calculated an importance matrix on 2.6 million tokens of Wikitext 103. The q2_K quant created using that particular importance matrix works fine.
It's because CUDA/HIP hip use IEEE 754 half precision floats which have a reduced numerical range to the single precision floats used on the CPU backend. I very much expect that if the weights were converted to BF16 instead of FP16 then you at least wouldn't get NaN. Alternatively, with |
I already tried disabling FP16 mat mul and |
Anyways, I've uploaded the importance matrix I generated: https://huggingface.co/JohannesGaessler/llama.cpp_importance_matrices |
Hmm, could this be related? |
So this has nothing to do with FP16 precision. The block scales of I don't know how the nans got there in the first place, but the model is not valid. |
Good to know, thanks for investigating. |
Thank you, I managed to generate IQ2_XS and even IQ2_XXS with the imatrix provided by JohannesGaessler. Not sure how bad the perplexity is: |
@slaren you mean that imatrix has NaNs, or the original Llama-3 model? |
I was talking specifically about the |
Ah sorry, I misread your message. Btw. Wouldn't it make sense to check for NaNs and other bad values in the model/imatrix when they are loaded/generated? I think it'd only take a fraction of second for Q quants but it could be beneficial to prevent spreading broken models? |
Yes, I think that would be good. |
Hi, mradermacher here. I was pointed at this PR, and I get a lot of negative feedback because of the misinformation spread here. First, I don't know where JohannesGaessler got his wrong info from, but the repository of course does/did not say that the imatrix is from another repository - he made this up. The process is also documented (linked from all my repos). Basically, I create imatrices the same way as everybody else, namely by running imatrix from llama.cpp with default parameters, as documented. The imatrix is not taken from another repository, it is created specifically for each model that I quantise. I have never hidden how I do things, and people who asked me about details have always gotten a constructive answer from me. Please, JohannesGaessler, don't just make things up in the future. My desire to contribute here or on hf have been dampened by this PR and the backlash I got as a result of your careless comment. I have documented that llama.cpp is crashing, for every affected model, and in the metadata of my quants I also added the exact crash reason. For every class of crash I could identify, I dutifully opened an issue (e.g. #6597 #6825 #6067), but there has been little to no interest on the side of llama.cpp to do anything about it or even investigate (there has been some, notably by slaren). I have since then verified that this is not caused by a hardware issue, by running imatrix generation on multiple different computers, getting the same identical (and problematic) imatrix output. The fact is, llama.cpp often generates nans during imatrix generation. I have proposed a patch (in 6018) that would catch this in at least some cases, but so far, this was ignored. It also seems clear that my patch does not catch all cases, resulting in reduced quality or broken quants even if imatrix generation and quantization seemingly is successful. llama.cpp failing to generate an imatrix (with my patch) or crashing is an extremely common occurence, not an isolated problem with a few models. Most of the time, these quants are not even being published. The original reporter here was very rude on huggingface and was hiding his comments, which is the reason I was uninterested in discussing this further with him, not as he claims that I was uninterested in discussiong any problems, as the numerous other community dicussions on my other models show. Especially since llama.cpp already had been notified of these issues and they were documented in the model card. Dear llama.cpp developers, do not get me wrong - I am not saying you are lazy, doing bad work, don't care, or need to fix bugs or support me in any way. However, it is clear that the imatrix/quant process is very buggy at the moment, either because of bugs in it, or because of bugs in the source models not caught by it, and for whatever (without doubt very good) reasons, there is little activity in getting this fixed or improved. I am, however, a bit frustrated that I keep opening bug reports and try to help, get mostly ignored, and then some random person can then spread lies about me here, supported (almost certainly unintentionally) by llama.cpp contributors. The very least thing is to either acknowledge the problems and bugs in llama.cpp, or at least investigate things before making conclusions. And to everybody else who reads this report and then plans to attack me personally, please check your facts before you do so. I am probably the person doing by far the most quants with llama.cpp on huggingface currently (more than 500TB in the last months), and it is therefore not surprising that I run into a lot of issues. But so far, all these issues have been either bugs in llama.cpp or (less often) problems with the source model. Blaming the messenger is the wrong reaction. Thanks for your concern. |
The repository no longer exists, but I thought I had read on its page that the importance matrix was taken from Nous Research. It is definitely possible that I may have mixed something up. In any case, it was a mistake made in good faith. |
@JohannesGaessler I don't doubt (and have not doubted) your good faith. All my model cards have essentially the same text with variations (e.g. https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF). The text said "weighted/imatrix quants of (url)" (as in https://huggingface.co/mradermacher/Eurux-8x22b-nca-i1-GGUF for example). Your comment in itself is harmless, the problem was that it is wrong. I'm not making you responsible for attacks by other people, of course, not even the original reporter (who clearly did not act in good faith), just pointing out the fallout that has happened because of factually wrong comments here. The real issue is that the problem is very, very common, affecting a large number of models, and in my book, are all bugs in llama.cpp, either because imatrix generation itself is buggy, or it crashes as a result of overflows etc. in the source models (which I consider a bug, because it should not crash, but that is in the eye of the beholder). The second issue is that actual bug reports by the person doing quantisation elicit very little response, but this bug report, including negative comments about my work, attract a lot more attraction, and even action. This does make me question whether my bug reports are even of value, and whether my time in reporting these issues is well spent. It shouldn't be the case that actual volunteer contributors have to deal with this. Again, I am not blaming any llama.cpp contributor, but a little respect towards the people using it large scale and trying to report issues would also not be misplaced. And possibly more communication. Maybe. We are all volunteers, with limited time and resources. I shouldn't have to write messages correcting misinformation about my work here of all places, wasting your and my time. |
PS: the repository no longer exists because I delete repos that are demonstrated to be broken. I have done this numerous times, and if the OR would actually have discussed the issue with me using his and my evidence, I would have deleted it earlier. |
Oh, and an even better example is https://huggingface.co/mradermacher/llama-3-70B-instruct-uncensored-i1-GGUF where I documented that llama.cpp crashes on some quants, and the metadata documents the crash reason:
Models where the imatrix generation aborts because of my patch in #6018) are not even uploaded, or sometimes I use a smaller training set, which often succeeds, but I think it is clear by now that this is no guarantee that llama.cpp generates a correct one. |
It's already close to 2 AM where I live but I think the 160K tokens you use as input may simply not be enough. I'll do some related testing tomorrow. |
The main reason this received more attention is because it pointed to a bug in a backend, which is something I am interested on and I can act upon immediately. And of course, base models will always receive more attention than merges, which have very little chance of being useful in the first place. Nonetheless, all bug reports are welcome. Hopefully the |
@JohannesGaessler I have never seen imatrix recover from nans just by using more tokens. In fact, fewer tokens have a higher chance of not running into the problem in the first place. E.g. once the perplexity calculation results in nans, imatrix has never recovered for me (and I did many hundred imatrix calculations, quite a few who have failed like this). Honestly, this smells like another ad-hoc claim (and unrelated, I probably live in the same city as you currently, greetings!). @slaren not sure your comment about merges having little chance of being useful will be well received by the sizable community on hf which creates and uses them (probably quite overrepresented among llama.cpp users), but thanks for stating the llama.cpp position (and priority) about it clearly. Also, while the checks might make the assertions go away, they still don't diagnose the problem, and do nothing towards fixing them. I totally agree that they are useful, though, but they are mostly just the same crash with different message. |
I only speak for myself. If you want the "llama.cpp position", ask @ggerganov. Please, don't make me have to prefix everything I say with "in my opinion". The main goal of the checks is to avoid the distributions of plainly wrong models, so that this situation doesn't happen again. It should also help detect when a base model (meaning the source model used to quantize) has issues. |
@slaren thanks for the clarification. however, you made a statement of fact, and I have to go by what you actually write, so don't make this my fault somehow :) Anyway, your opinion about priorities is really appreciated. Regarding the actual problem at hand, the question is whether models that do something (i.e. work, to some degree, with transformers) should be completely refused by llama.cpp. The quants in question here did give totally reasonable answers if you use the correct template, so were obviously not totally broken or unusable, at least when used with cuda or the cpu. What I am saying is that there is a class of models that would work totally fine, but cannot be converted to gguf anymore. |
Specifically, this model had |
@slaren: I think the confusion here is between the source model and the quants (which are also models). Currently, both are being refused. Also, I simply think it goes a bit far to dictate to everybody what is an acceptable output for models that (with transformers, or even broken quants) gives reasonable output. As a concrete example of the second class, I have asked the "unacceptable" quant that this report is about many questions, such as "What is the capital of France", "What is the problem with the definition of the second Bernoulli number?" and got perfectly fine answers (all with cuda on my rtx 4090). Something might have gone terribly wrong, but that's still simply what llama.cpp does at the moment for some inputs. And yes, I am mostly concerned about refusing unquantized models (which is happening). In any case, I am not against better diagnosing, but the check that is implemented does not actually diagnose the problem. It diagnoses what could be a symptom of a problem, but it also might misfire. I am not sure this is automatically a positive thing - sure, some possibly problematic models get refused, but we don't know if this causes actual problems (clearly, it depends on the input, so you are refusing models that might be perfectly fine except for some inputs). |
There is simply no way that llama.cpp can guarantee that a model with nans will work at all. In most cases it will result in the entire activation to be turned into nan, which then will result in all nans logits. In this specific case and purely by chance, the nan were converted to zero before they had a chance to be propagated. If you really want to do this, your best bet would be to write a tool to filter the nan values out of the base model, and replace them with zeroes or any other finite value. |
Using Wikitext-103 train as input and the models that I already have available I am so far not able to provoke
The data for an importance matrix on disk uses types such as
Generally speaking, it is not feasible to guarantee good results for a numerical calculation for all input values or to determine whether specific input values will produce bad results.
I primarily work on the things that I am interested in/familiar with. Sorry, but that's just how it is with an open-source project; you either have to fix things yourself or you have to get a dev to care enough to fix it.
I 100% agree with slaren on this, models with NaN values should produce a hard error. |
@JohannesGaessler so how would a sum of something + nans turn out into something not a nan, something you claimed would happen? Nobody has seen this happen, and numerically it cannot happen. What is seen is that once the perplexity calculations report a nan, this will never change, even with lots of tokens. Also, I am not sure why you react so defensively or want to educate me on how open source projects work (which I am well aware of, contributing to free software for more than 30 years now), I have made it clear multiple times that I don't expect you to fix bugs I report, and that we are all volunteers. What I do expect is that bugs and bug reporters are taken seriously, even if there are no resources to investigate. What you are doing is continuously make up ad-hoc claims that somehow show that I am doing something wrong (e.g. using the wrong imatrix, not using enough tokens), without backing it up. Not a good thing to do. BTW, I have never distributed models with known nans, but I don't think I want you to order me around like this. Anyway, what I get from all your responses is that llama.cpp does not care much about the correctness of imatrix quants, and the tendency is to blame the reporter of doing something not right without being able to specify what "right" would be. I came here to set the record straight, and I will not bother you again with reports from now on. Good luck and have fun. |
My understanding is that due to conditional statements involving NaNs being evaluated as false the overall result of matrix multiplications in And my assumption about the issue was that the NaNs are not in the imatrix itself but rather occur when the model is later quantized due to poor numerical stability.
I am taking this seriously, that is why I'm trying to reproduce the issue which so far I am not able to. |
Perhaps I'm on the summit of mount stupid here due to Dunning Kruger effects, but from my perspective this issue would be resolved simply by setting NaN to 0 when it's found. The validation phase seems like a good place to do it. Would it be possible to add it as an option? Maybe a good first issue for an aspiring contributor. |
I would rather avoid creating bad models in the first place than masking the issue. The validation can be costly and is only run when requested. Also, when using mmap, the model cannot be modified. My view is that if you know what you are doing and you are ok with the nans, you can write a program to replace the nan values yourself with whatever value you want in its place. During quantization both the source and quant data is validated, so that should prevent creating new models with nans from this point. |
Setting any values to 0 is equivalent to just removing them from the model. Experiments have shown that you can remove up to ~10% of model weights at random but this severely degrades quality. Rather than accepting a faulty model with NaNs and producing degraded outputs I think a hard error makes more sense since this is likely not what the user wants. |
Hello. I did some perplexity tests while investigating issue with Llama-3. Initially the issue was that Llama-3 base 70b model outputs garbage with small quants with iMatrix. Can't find any information regarding that ROCm possibly causes corruption.
GPU test (RX 7600 + RX 7600 XT)
https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF/tree/main
Meta-Llama-3-70B.i1-Q2_K.gguf prints [1]-nan,[2]-nan,[3]-nan,[4]-nan with -ngl 30 or 0 (prints garbage unless -ngl 0)
https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF/tree/main
Meta-Llama-3-70B.Q2_K.gguf - seems OK, [1]4.1839,[2]4.7300,[3]4.2751,[4]4.6444,[5]4.6942,[6]5.0426,[7]5.1405,[8]5.4747
Final estimate: PPL = 5.9315 +/- 0.03553
Pure CPU test
Meta-Llama-3-70B.i1-Q2_K.gguf with pure CPU 'perplexity' build (146 seconds per 512 tokens - ETA 26 hours 55.67 minutes)
[1]6.3962,[2]7.1886,[3]6.9886,[4]7.3853,[5]7.8924,[6]8.2982,[7]8.8956,[8]9.3799, (can't wait for many hours, stopped)
Meta-Llama-3-70B.Q2_K.gguf (static Q2_K):
[1]4.1675,[2]4.6952,[3]4.2374,[4]4.6452,[5]4.6677,[6]5.0459,[7]5.1258,[8]5.4649,^C
It's slightly better than on ROCm but the difference is very small.
I also found strange holes in the imatrix.dat that was used:
![Screenshot from 2024-04-23 13-34-09](https://private-user-images.githubusercontent.com/6028184/324808252-43f7b889-618e-4e47-afab-a12603b22ade.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1OTMzNTYsIm5iZiI6MTczOTU5MzA1NiwicGF0aCI6Ii82MDI4MTg0LzMyNDgwODI1Mi00M2Y3Yjg4OS02MThlLTRlNDctYWZhYi1hMTI2MDNiMjJhZGUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTVUMDQxNzM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ODg5YmE2NWYwNGQyYWQ4NWNkZjhmYTQ3NTQ5NjU5OThlMWQ4NDVmMDhkOWQwNTk3YzQxOTZmNDZlMTRkYmEwNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.D1nSuxPcHN56Izi2ARb3a4fF-xIUNzFGKAfpuLNltQA)
But the author seems uninterested in discussing that.
The text was updated successfully, but these errors were encountered: