Use vdotq_s32 to improve performance #67

ggerganov · 2023-03-12T23:34:46Z

I observe 10% performance improvement on M1 Pro with 8 threads

However, it seems to cause illegal instruction on M1 Air

https://twitter.com/miolini/status/1635055060316200960

Need to figure out why.
Would be nice to confirm

turbo · 2023-03-13T00:09:53Z

Data point: no fault on M1 Ultra (16 threads, 65B), same bump:

this branch: 248.88 ms per token
default: 269.74 ms per token

thomasantony · 2023-03-13T05:23:55Z

Can confirm that there are no errors on M1 Max either.

miolini · 2023-03-13T05:25:18Z

Cannot run ./main on Macbook Air M1 anymore. But it works on Raspberry Pi 4.

(llama.cpp) @mio: llama.cpp $ ./main -m ./models/7B/ggml-model-q4_0.bin
-t 8
-n 128
-p 'Best 3 designs for compact portable nuclear fusion reactor is '
main: seed = 1678684772
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
zsh: illegal hardware instruction ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 -p

ggerganov · 2023-03-13T05:44:53Z

@miolini Just in case, can you do make clean and also regenerate the 7B ggml model from scratch.

miolini · 2023-03-13T06:48:32Z

@ggerganov I did it many times. No luck. But I guess I found the problem. It says it now builds x86_64 binary after git pull.

(llama.cpp) @mio: llama.cpp $ file ./main
./main: Mach-O 64-bit executable x86_64

miolini · 2023-03-13T06:50:06Z

Build log

(llama.cpp) @mio: llama.cpp $ make
Makefile:24: Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: ggerganov/whisper.cpp#66 (comment)
sysctl: unknown oid 'machdep.cpu.leaf7_features'
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: i386
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mf16c -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)

cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mf16c -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main -framework Accelerate
./main -h
usage: ./main [options]

options:
-h, --help show this help message and exit
-i, --interactive run in interactive mode
--interactive-start run in interactive mode and poll user input at startup
-r PROMPT, --reverse-prompt PROMPT
in interactive mode, poll user input upon seeing PROMPT
--color colorise output to distinguish prompt and user input from generations
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: random)
-f FNAME, --file FNAME
prompt file to start generation.
-n N, --n_predict N number of tokens to predict (default: 128)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--repeat_last_n N last n tokens to consider for penalize (default: 64)
--repeat_penalty N penalize repeat sequence of tokens (default: 1.3)
--temp N temperature (default: 0.8)
-b N, --batch_size N batch size for prompt processing (default: 8)
-m FNAME, --model FNAME
model path (default: models/llama-7B/ggml-model.bin)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize -framework Accelerate

miolini · 2023-03-13T07:02:27Z

Source of problem - pipenv makes shell with wring environment. I will try to fix it on my side.

sbassi · 2023-03-22T23:32:04Z

@miolini Could you please tell how you fixed? I am having the same issue on Macbook with conda, but it works with MacMini.

10% performance boost on ARM

0ac8651

Back to original change

0fa481b

ggerganov merged commit 84d9015 into master Mar 13, 2023

ggerganov deleted the vdotq_s32 branch March 13, 2023 16:36

Mestrace mentioned this pull request Mar 14, 2023

Raspberry Pi 4 4GB #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vdotq_s32 to improve performance #67

Use vdotq_s32 to improve performance #67

ggerganov commented Mar 12, 2023 •

edited

Loading

turbo commented Mar 13, 2023

thomasantony commented Mar 13, 2023

miolini commented Mar 13, 2023 •

edited

Loading

ggerganov commented Mar 13, 2023

miolini commented Mar 13, 2023

miolini commented Mar 13, 2023

miolini commented Mar 13, 2023

sbassi commented Mar 22, 2023

Use vdotq_s32 to improve performance #67

Use vdotq_s32 to improve performance #67

Conversation

ggerganov commented Mar 12, 2023 • edited Loading

turbo commented Mar 13, 2023

thomasantony commented Mar 13, 2023

miolini commented Mar 13, 2023 • edited Loading

ggerganov commented Mar 13, 2023

miolini commented Mar 13, 2023

miolini commented Mar 13, 2023

miolini commented Mar 13, 2023

sbassi commented Mar 22, 2023

ggerganov commented Mar 12, 2023 •

edited

Loading

miolini commented Mar 13, 2023 •

edited

Loading