Allow s390x to load little endian models unmodified #11234

AlekseiNikiforovIBM · 2025-01-14T12:34:59Z

Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.

JohannesGaessler · 2025-01-14T14:09:50Z

I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term?

AlekseiNikiforovIBM · 2025-01-14T14:29:12Z

Yes, we intend to maintain big endian support long term.

We could also help to integrate Z VM guest as test system into CI if you'd like to.

JohannesGaessler · 2025-01-14T15:17:49Z

ggml/src/ggml.c

+#if defined(__gnu_linux__)
+#include <endian.h>
+#else
+#define le64toh(x) (x)
+#define le32toh(x) (x)
+#define le16toh(x) (x)
+#endif
+
+// endianness conversion
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define convert_from_le16(x) UNUSED(x)
+#define convert_from_le32(x) UNUSED(x)
+#define convert_from_le64(x) UNUSED(x)
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+static inline void convert_from_le16(void * value) {
+    *((uint16_t*)value) = le16toh(*((uint16_t*)value));
+}
+
+static inline void convert_from_le32(void * value) {
+    *((uint32_t*)value) = le32toh(*((uint32_t*)value));
+}
+
+static inline void convert_from_le64(void * value) {
+    *((uint64_t*)value) = le64toh(*((uint64_t*)value));
+}
+#else
+#error Unexpected or undefined __BYTE_ORDER__
+#endif


This should be defined once in ggml-impl.h. Also check the newly updated contributing guidelines regarding the closing of preprocessor directives.

JohannesGaessler · 2025-01-14T15:24:28Z

ggml/src/ggml.c

+static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) {
+    block_q4_0 *data_ptr = (block_q4_0*) buffer;
+    for (size_t i = 0; i < elements; ++i) {
+        convert_from_le16(&(data_ptr[i].d));
+    }
+}


Why is only the endianness for the scale d but not the quantized values q being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?

I didn't check this function, but ggml_byteswap_q4_k and ggml_byteswap_q6_k are used. This function I've implemented similarly.

I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.

More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.

I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.

I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.

Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.

Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.

As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.

taronaeo · 2025-01-15T04:05:49Z

If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f).

At current, even with --bigendian conversion using the convert_hf_to_gguf.py tool, Llama-3.2-1B fails to load correctly and this behaviour is similar to what is seen in #3298 (comment) as demonstrated here

$ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea  1." --seed 1568795874

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 1568795874
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1

Write me a dog walking business idea  1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

llama_perf_sampler_print:    sampling time =       3.97 ms /    62 runs   (    0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print:        load time =     702.51 ms
llama_perf_context_print: prompt eval time =     459.48 ms /    12 tokens (   38.29 ms per token,    26.12 tokens per second)
llama_perf_context_print:        eval time =    9103.29 ms /    49 runs   (  185.78 ms per token,     5.38 tokens per second)
llama_perf_context_print:       total time =    9602.03 ms /    61 tokens

If support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian.

AlekseiNikiforovIBM · 2025-01-22T13:06:51Z

I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349

Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.

Usually downloaded models are little-endian and byteswapping is needed. Byteswapping is not implemented for mmap model loading.

Bitfields are allocated in different order on s390x

Add ggml_ prefix to their names.

…n platform

AlekseiNikiforovIBM · 2025-01-22T13:24:17Z

Rebased PR to resolve merge conflicts

AlekseiNikiforovIBM · 2025-01-30T14:00:55Z

I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment.

I'd like to proceed with this PR. What's needed for that?

slaren · 2025-01-30T14:48:08Z

I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess.

Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading.

Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types.

Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged.

AlekseiNikiforovIBM · 2025-01-30T17:04:12Z

I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic.

As for byteswapping itself, I'll try reworking it.

AlekseiNikiforovIBM requested a review from JohannesGaessler as a code owner January 14, 2025 12:35

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025

JohannesGaessler reviewed Jan 14, 2025

View reviewed changes

github-actions bot added the examples label Jan 22, 2025

AlekseiNikiforovIBM added 14 commits January 22, 2025 14:07

Load little-endian models on s390x

9a4b0df

Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.

Implement most of remaining byteswap functions

a8757fe

Implement byteswap for tq1_0 and tq2_0

0682209

Disable mmap on s390x

21f7ca2

Usually downloaded models are little-endian and byteswapping is needed. Byteswapping is not implemented for mmap model loading.

Get rid of additional memcpy calls

088f9a6

Implement write byteswap for tests

27c19c4

Fix unicode flags conversion from and to uint16_t

fa8fc31

Bitfields are allocated in different order on s390x

Fix unused variable warnings

1d01548

Move conversion functions to common header

a9402ba

Add ggml_ prefix to their names.

Make assert messages unique

1d06f0f

Update alignment of byteswap function type definition

cfb2cd1

Update preprocessor directives according to guidelines

3c22daa

Disable mmap on s390x in llama-quant too

f4217a8

Implement --no-byteswap argument to disable byteswapping on big endia…

a9db9b0

…n platform

AlekseiNikiforovIBM force-pushed the s390x_support branch from 43544fb to a9db9b0 Compare January 22, 2025 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow s390x to load little endian models unmodified #11234

Allow s390x to load little endian models unmodified #11234

AlekseiNikiforovIBM commented Jan 14, 2025

JohannesGaessler commented Jan 14, 2025

AlekseiNikiforovIBM commented Jan 14, 2025

JohannesGaessler Jan 14, 2025

JohannesGaessler Jan 14, 2025

AlekseiNikiforovIBM Jan 14, 2025

JohannesGaessler Jan 14, 2025

slaren Jan 14, 2025

AlekseiNikiforovIBM Jan 16, 2025

taronaeo commented Jan 15, 2025

AlekseiNikiforovIBM commented Jan 22, 2025

AlekseiNikiforovIBM commented Jan 22, 2025

AlekseiNikiforovIBM commented Jan 30, 2025

slaren commented Jan 30, 2025

AlekseiNikiforovIBM commented Jan 30, 2025

Allow s390x to load little endian models unmodified #11234

Are you sure you want to change the base?

Allow s390x to load little endian models unmodified #11234

Conversation

AlekseiNikiforovIBM commented Jan 14, 2025

JohannesGaessler commented Jan 14, 2025

AlekseiNikiforovIBM commented Jan 14, 2025

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

AlekseiNikiforovIBM Jan 14, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

slaren Jan 14, 2025

Choose a reason for hiding this comment

AlekseiNikiforovIBM Jan 16, 2025

Choose a reason for hiding this comment

taronaeo commented Jan 15, 2025

AlekseiNikiforovIBM commented Jan 22, 2025

AlekseiNikiforovIBM commented Jan 22, 2025

AlekseiNikiforovIBM commented Jan 30, 2025

slaren commented Jan 30, 2025

AlekseiNikiforovIBM commented Jan 30, 2025