Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow s390x to load little endian models unmodified #11234

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

AlekseiNikiforovIBM
Copy link

Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025
@JohannesGaessler
Copy link
Collaborator

I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term?

@AlekseiNikiforovIBM
Copy link
Author

Yes, we intend to maintain big endian support long term.

We could also help to integrate Z VM guest as test system into CI if you'd like to.

ggml/src/ggml.c Outdated
Comment on lines 39 to 66
#if defined(__gnu_linux__)
#include <endian.h>
#else
#define le64toh(x) (x)
#define le32toh(x) (x)
#define le16toh(x) (x)
#endif

// endianness conversion
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define convert_from_le16(x) UNUSED(x)
#define convert_from_le32(x) UNUSED(x)
#define convert_from_le64(x) UNUSED(x)
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
static inline void convert_from_le16(void * value) {
*((uint16_t*)value) = le16toh(*((uint16_t*)value));
}

static inline void convert_from_le32(void * value) {
*((uint32_t*)value) = le32toh(*((uint32_t*)value));
}

static inline void convert_from_le64(void * value) {
*((uint64_t*)value) = le64toh(*((uint64_t*)value));
}
#else
#error Unexpected or undefined __BYTE_ORDER__
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml/src/ggml.c Outdated
Comment on lines 6548 to 6590
static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) {
block_q4_0 *data_ptr = (block_q4_0*) buffer;
for (size_t i = 0; i < elements; ++i) {
convert_from_le16(&(data_ptr[i].d));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is only the endianness for the scale d but not the quantized values q being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check this function, but ggml_byteswap_q4_k and ggml_byteswap_q6_k are used. This function I've implemented similarly.

I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.

Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.

Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.

As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.

@taronaeo
Copy link

If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f).

At current, even with --bigendian conversion using the convert_hf_to_gguf.py tool, Llama-3.2-1B fails to load correctly and this behaviour is similar to what is seen in #3298 (comment) as demonstrated here

$ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea  1." --seed 1568795874

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 1568795874
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1

Write me a dog walking business idea  1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

llama_perf_sampler_print:    sampling time =       3.97 ms /    62 runs   (    0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print:        load time =     702.51 ms
llama_perf_context_print: prompt eval time =     459.48 ms /    12 tokens (   38.29 ms per token,    26.12 tokens per second)
llama_perf_context_print:        eval time =    9103.29 ms /    49 runs   (  185.78 ms per token,     5.38 tokens per second)
llama_perf_context_print:       total time =    9602.03 ms /    61 tokens

If support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian.

@AlekseiNikiforovIBM
Copy link
Author

I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349

@AlekseiNikiforovIBM
Copy link
Author

Rebased PR to resolve merge conflicts

@AlekseiNikiforovIBM
Copy link
Author

I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment.

I'd like to proceed with this PR. What's needed for that?

@slaren
Copy link
Collaborator

slaren commented Jan 30, 2025

I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess.

Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading.

Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types.

Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged.

@AlekseiNikiforovIBM
Copy link
Author

I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic.

As for byteswapping itself, I'll try reworking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants