Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vdotq_s32 to improve performance #67

Merged
merged 2 commits into from
Mar 13, 2023
Merged

Use vdotq_s32 to improve performance #67

merged 2 commits into from
Mar 13, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 12, 2023

I observe 10% performance improvement on M1 Pro with 8 threads

However, it seems to cause illegal instruction on M1 Air

https://twitter.com/miolini/status/1635055060316200960

Need to figure out why.
Would be nice to confirm

@turbo
Copy link

turbo commented Mar 13, 2023

Data point: no fault on M1 Ultra (16 threads, 65B), same bump:

  • this branch: 248.88 ms per token
  • default: 269.74 ms per token

@thomasantony
Copy link

Can confirm that there are no errors on M1 Max either.

@miolini
Copy link

miolini commented Mar 13, 2023

Cannot run ./main on Macbook Air M1 anymore. But it works on Raspberry Pi 4.

(llama.cpp) @mio: llama.cpp $ ./main -m ./models/7B/ggml-model-q4_0.bin
-t 8
-n 128
-p 'Best 3 designs for compact portable nuclear fusion reactor is '
main: seed = 1678684772
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
zsh: illegal hardware instruction ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 -p

@ggerganov
Copy link
Owner Author

@miolini Just in case, can you do make clean and also regenerate the 7B ggml model from scratch.

@miolini
Copy link

miolini commented Mar 13, 2023

@ggerganov I did it many times. No luck. But I guess I found the problem. It says it now builds x86_64 binary after git pull.

(llama.cpp) @mio: llama.cpp $ file ./main
./main: Mach-O 64-bit executable x86_64

@miolini
Copy link

miolini commented Mar 13, 2023

Build log

(llama.cpp) @mio: llama.cpp $ make
Makefile:24: Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: ggerganov/whisper.cpp#66 (comment)
sysctl: unknown oid 'machdep.cpu.leaf7_features'
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: i386
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mf16c -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)

cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mf16c -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main -framework Accelerate
./main -h
usage: ./main [options]

options:
-h, --help show this help message and exit
-i, --interactive run in interactive mode
--interactive-start run in interactive mode and poll user input at startup
-r PROMPT, --reverse-prompt PROMPT
in interactive mode, poll user input upon seeing PROMPT
--color colorise output to distinguish prompt and user input from generations
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: random)
-f FNAME, --file FNAME
prompt file to start generation.
-n N, --n_predict N number of tokens to predict (default: 128)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--repeat_last_n N last n tokens to consider for penalize (default: 64)
--repeat_penalty N penalize repeat sequence of tokens (default: 1.3)
--temp N temperature (default: 0.8)
-b N, --batch_size N batch size for prompt processing (default: 8)
-m FNAME, --model FNAME
model path (default: models/llama-7B/ggml-model.bin)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize -framework Accelerate

@miolini
Copy link

miolini commented Mar 13, 2023

Source of problem - pipenv makes shell with wring environment. I will try to fix it on my side.

@ggerganov ggerganov merged commit 84d9015 into master Mar 13, 2023
@ggerganov ggerganov deleted the vdotq_s32 branch March 13, 2023 16:36
@Mestrace Mestrace mentioned this pull request Mar 14, 2023
@sbassi
Copy link

sbassi commented Mar 22, 2023

@miolini Could you please tell how you fixed? I am having the same issue on Macbook with conda, but it works with MacMini.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants