-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow s390x to load little endian models unmodified #11234
base: master
Are you sure you want to change the base?
Allow s390x to load little endian models unmodified #11234
Conversation
I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term? |
Yes, we intend to maintain big endian support long term. We could also help to integrate Z VM guest as test system into CI if you'd like to. |
ggml/src/ggml.c
Outdated
#if defined(__gnu_linux__) | ||
#include <endian.h> | ||
#else | ||
#define le64toh(x) (x) | ||
#define le32toh(x) (x) | ||
#define le16toh(x) (x) | ||
#endif | ||
|
||
// endianness conversion | ||
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ | ||
#define convert_from_le16(x) UNUSED(x) | ||
#define convert_from_le32(x) UNUSED(x) | ||
#define convert_from_le64(x) UNUSED(x) | ||
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ | ||
static inline void convert_from_le16(void * value) { | ||
*((uint16_t*)value) = le16toh(*((uint16_t*)value)); | ||
} | ||
|
||
static inline void convert_from_le32(void * value) { | ||
*((uint32_t*)value) = le32toh(*((uint32_t*)value)); | ||
} | ||
|
||
static inline void convert_from_le64(void * value) { | ||
*((uint64_t*)value) = le64toh(*((uint64_t*)value)); | ||
} | ||
#else | ||
#error Unexpected or undefined __BYTE_ORDER__ | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined once in ggml-impl.h
. Also check the newly updated contributing guidelines regarding the closing of preprocessor directives.
ggml/src/ggml.c
Outdated
static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) { | ||
block_q4_0 *data_ptr = (block_q4_0*) buffer; | ||
for (size_t i = 0; i < elements; ++i) { | ||
convert_from_le16(&(data_ptr[i].d)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is only the endianness for the scale d
but not the quantized values q
being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check this function, but ggml_byteswap_q4_k
and ggml_byteswap_q6_k
are used. This function I've implemented similarly.
I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py
. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.
Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.
Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.
As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.
If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f). At current, even with $ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea 1." --seed 1568795874
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 1568795874
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
Write me a dog walking business idea 1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
llama_perf_sampler_print: sampling time = 3.97 ms / 62 runs ( 0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print: load time = 702.51 ms
llama_perf_context_print: prompt eval time = 459.48 ms / 12 tokens ( 38.29 ms per token, 26.12 tokens per second)
llama_perf_context_print: eval time = 9103.29 ms / 49 runs ( 185.78 ms per token, 5.38 tokens per second)
llama_perf_context_print: total time = 9602.03 ms / 61 tokens If support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian. |
I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349 |
Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.
Usually downloaded models are little-endian and byteswapping is needed. Byteswapping is not implemented for mmap model loading.
Bitfields are allocated in different order on s390x
Add ggml_ prefix to their names.
43544fb
to
a9db9b0
Compare
Rebased PR to resolve merge conflicts |
I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment. I'd like to proceed with this PR. What's needed for that? |
I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess. Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading. Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types. Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged. |
I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic. As for byteswapping itself, I'll try reworking it. |
Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.