-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add basic tensor data validation function #6884
Conversation
I'm testing on M2 Ultra and 70B models load time increases noticeably:
There is no ARM SIMD optimization in the F16 branch, so this can help, but also not sure if we want to implement it for all instruction sets. Maybe this validation should be opt-in? Probably we can do it every time with the |
It is not as bad for me on x86, with purely CPU backend it doubles the load time, but the load time with mmap is always very fast. When offloading the overhead becomes much less significant. The AVX2 implementation helps a lot, it makes checking FP16 models about as fast as quant models. Still, I agree that it is too slow to leave this enabled unconditionally, so I will make it optional except for |
tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors.
Hmm, quantize won't validate tensors which are being written? |
This should be a lot faster now in most cases. There is an NEON implementation for FP16 validation, and the model validation is multithreaded with If this proves to be useful, it may be a good to enable it by default in the server, since it is not a process that is meant to be restarted very often, the increase in load time should be less important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 70B F16 case is now down to 20s (from 50s)
Yes, we can add it to server. Also, let's note the ggml-ci
time increase since we do a lot of quantizations there and it might lead to significant increase. If so, we might want to add an option to disable during quantize
in the CI, since the model data is not changing there. But only if the time increase is significant
It doesn't seem to affect the CI times significantly, in fact the cuda-v100 time was a bit lower than the latest time on master, so I think that it is within the normal variation between runs. |
* add basic tensor data validation function * add --check-tensors command line argument tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors.
Adds
ggml_validate_row_data
to validate tensor data, and adds this validation to llama.cpp during model loading. For floating point tensors it checks all the data, for quant types it checks the scales only. The validation consists of checking fornan
andinf
values in the tensors.Should help detect issues with models such as reported in #6841.
Hopefully it won't increase load time too much so that it can be left enabled permanently, but more testing is necessary.