Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi-2 completely broken on Vulkan #5243

Closed
stduhpf opened this issue Jan 31, 2024 · 6 comments · Fixed by #5260
Closed

Phi-2 completely broken on Vulkan #5243

stduhpf opened this issue Jan 31, 2024 · 6 comments · Fixed by #5260
Assignees
Labels
bug Something isn't working

Comments

@stduhpf
Copy link
Contributor

stduhpf commented Jan 31, 2024

I get garbage output when offloading any layer to GPU when running Phi-2 models with the Vulkan backend. The issue seems to be with the first and last layers mostly.

.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X

(main: build = 2035 (7977a2a0))

-ngl 0 (control)

Here is a reciepe for tomato soup:

Ingredients:
- 4 cups of chicken broth
- 2 tablespoons of butter
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tomatoes, peeled and diced
- Salt and pepper to taste
- Parsley for garnish

Directions:
- In a large pot, melt the butter over medium heat. Add the onion and garlic and cook until soft, about 10 minutes.
- Stir in the chicken broth and bring to a boil. Reduce the heat and simmer for 15 minutes, stirring occasionally.
- Add the tomatoes and season with salt and pepper. Cook for another 10 minutes,
llama_print_timings:        load time =     329.52 ms
llama_print_timings:      sample time =      29.20 ms /   128 runs   (    0.23 ms per token,  4382.96 tokens per second)
llama_print_timings: prompt eval time =     310.37 ms /    11 tokens (   28.22 ms per token,    35.44 tokens per second)
llama_print_timings:        eval time =    8578.80 ms /   127 runs   (   67.55 ms per token,    14.80 tokens per second)
llama_print_timings:       total time =    8949.84 ms /   138 tokens
Log end

-ngl 1

Here is a reciepe for tomato soup:

Ingredients:- "
 [end of text]

llama_print_timings:        load time =     641.73 ms
llama_print_timings:      sample time =       1.47 ms /     7 runs   (    0.21 ms per token,  4768.39 tokens per second)
llama_print_timings: prompt eval time =     312.33 ms /    11 tokens (   28.39 ms per token,    35.22 tokens per second)
llama_print_timings:        eval time =     666.72 ms /     6 runs   (  111.12 ms per token,     9.00 tokens per second)
llama_print_timings:       total time =     983.36 ms /    17 tokens
Log end

Starts ok, but glitches after a few tokens generated. (in this case it generated an eos token, so it ended the generation early, but with a different prompt/higer temp, the output is just noisy gibberish)

using `-p "Here is a reciepe for tomato soup:\n\n"`
Here is a reciepe for tomato soup:

 - "Tomato SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe----------------------------- "-- SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe-------------------------

-ngl 2

Here is a reciepe for tomato soup:

- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe-----------------------------------------------------
llama_print_timings:        load time =     562.83 ms
llama_print_timings:      sample time =      27.43 ms /   128 runs   (    0.21 ms per token,  4665.91 tokens per second)
llama_print_timings: prompt eval time =     304.00 ms /    11 tokens (   27.64 ms per token,    36.18 tokens per second)
llama_print_timings:        eval time =    8149.43 ms /   127 runs   (   64.17 ms per token,    15.58 tokens per second)
llama_print_timings:       total time =    8507.07 ms /   138 tokens
Log end

(ngl 2 to 32 all produce the same output, only the inference speed changes)

-ngl 32

Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe------------------------------------------------------
llama_print_timings:        load time =    1180.39 ms
llama_print_timings:      sample time =      32.76 ms /   128 runs   (    0.26 ms per token,  3906.97 tokens per second)
llama_print_timings: prompt eval time =     184.50 ms /    11 tokens (   16.77 ms per token,    59.62 tokens per second)
llama_print_timings:        eval time =    2464.90 ms /   127 runs   (   19.41 ms per token,    51.52 tokens per second)
llama_print_timings:       total time =    2707.77 ms /   138 tokens
Log end

-ngl 33 (all layers)

Here is a reciepe for tomato soup:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
llama_print_timings:        load time =    1076.60 ms
llama_print_timings:      sample time =      34.70 ms /   128 runs   (    0.27 ms per token,  3688.97 tokens per second)
llama_print_timings: prompt eval time =     168.66 ms /    11 tokens (   15.33 ms per token,    65.22 tokens per second)
llama_print_timings:        eval time =    1424.31 ms /   127 runs   (   11.21 ms per token,    89.17 tokens per second)
llama_print_timings:       total time =    1652.01 ms /   138 tokens
Log end

(always repeating a single token, seems to use mostly '!', 'G' or 'o')

@0cc4m 0cc4m self-assigned this Jan 31, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Jan 31, 2024

You seem to have a knack for finding issues with the Vulkan code. And I just fixed Phi.. I guess there's another matmul issue with the Windows AMD driver? I'll try to find it.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 31, 2024

ERROR: Invalid value in UNARY i3=0 i2=0 i1=0 i0=7914 result=-nan correct=20.625 nmse=6.81852e-08
tensor=0x60b8b73eb6b0 tensor->name=ffn_gelu-29 tensor->backend: 10 tensor->type: f32 ne0=10240 nb0=4 ne1=2 nb1=40960 ne2=1 nb2=81920 ne3=1 nb3=81920 offset=0
src0=0x60b8b73eb520 src0->name=ffn_up_b-29 op=ADD type=f32 backend=10 ne0=10240 nb0=4 ne1=2 nb1=40960 ne2=1 nb2=81920 ne3=1 nb3=81920 offset=0
First error: result=-1 correct=-1 i3=-1 i2=-1 i1=-1 i0=-1

Result:
               0       1       2       3       4       5       6       7       8       9
   7909:    0,99    0,99
   7910:   -0,15   -0,15
   7911:   -0,17   -0,17
   7912:   -0,15   -0,15
   7913:   -0,17   -0,17
   7914:    -nan    -nan
   7915:   -0,03   -0,03
   7916:   -0,14   -0,14
   7917:   -0,14   -0,14
   7918:   -0,14   -0,14

Correct:
               0       1       2       3       4       5       6       7       8       9
   7909:    0,99    0,99
   7910:   -0,15   -0,15
   7911:   -0,17   -0,17
   7912:   -0,15   -0,15
   7913:   -0,17   -0,17
   7914:   20,62   20,62
   7915:   -0,03   -0,03
   7916:   -0,14   -0,14
   7917:   -0,14   -0,14
   7918:   -0,14   -0,14

Seems to be the GELU shader this time, which AMD's proprietary driver doesn't like. Let me know if anyone spots the likely cause of that NaN.

@0cc4m 0cc4m added bug Something isn't working and removed bug-unconfirmed labels Feb 1, 2024
@ggerganov
Copy link
Owner

For the Metal kernel, we had to explicitly call a more precise tanh call: precise::tanh

llama.cpp/ggml-metal.metal

Lines 266 to 277 in 1cfb537

kernel void kernel_gelu(
device const float4 * src0,
device float4 * dst,
uint tpig[[thread_position_in_grid]]) {
device const float4 & x = src0[tpig];
// BEWARE !!!
// Simply using "tanh" instead of "precise::tanh" will sometimes results in NaNs!
// This was observed with Falcon 7B and 40B models
//
dst[tpig] = 0.5f*x*(1.0f + precise::tanh(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
}

Without this change, it was producing NaNs

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 1, 2024

For the Metal kernel, we had to explicitly call a more precise tanh call: precise::tanh

Thank you, that's helpful. But I don't think I have any other implementation of tanh available. The GPU driver provides the implementation and it does work on most of them. I'll try to think of a workaround for the proprietary AMD driver.

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 1, 2024

I'm not sure how bad this would be for performance (because of branching and all that) and accuracy, but what about using sign(x)when abs(x) gets big enough for the difference to be insignificant, and tanh(x) only for smaller absolute values of x?

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 1, 2024

I'm not sure how bad this would be for performance (because of branching and all that) and accuracy, but what about using sign(x)when abs(x) gets big enough for the difference to be insignificant, and tanh(x) only for smaller absolute values of x?

Thanks for the suggestion. I found an even better one that seems to work: tanh(x) = 1 - 2 / (exp(2 * x) + 1). I think that might also be what Metal is doing if you use precise::tanh. I'll open a PR in a few minutes, would be nice if you could test it @stduhpf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants