Phi-2 completely broken on Vulkan #5243

stduhpf · 2024-01-31T14:39:28Z

I get garbage output when offloading any layer to GPU when running Phi-2 models with the Vulkan backend. The issue seems to be with the first and last layers mostly.

`.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X`

(main: build = 2035 (7977a2a0))

`-ngl 0` (control)

Here is a reciepe for tomato soup:

Ingredients:
- 4 cups of chicken broth
- 2 tablespoons of butter
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tomatoes, peeled and diced
- Salt and pepper to taste
- Parsley for garnish

Directions:
- In a large pot, melt the butter over medium heat. Add the onion and garlic and cook until soft, about 10 minutes.
- Stir in the chicken broth and bring to a boil. Reduce the heat and simmer for 15 minutes, stirring occasionally.
- Add the tomatoes and season with salt and pepper. Cook for another 10 minutes,
llama_print_timings:        load time =     329.52 ms
llama_print_timings:      sample time =      29.20 ms /   128 runs   (    0.23 ms per token,  4382.96 tokens per second)
llama_print_timings: prompt eval time =     310.37 ms /    11 tokens (   28.22 ms per token,    35.44 tokens per second)
llama_print_timings:        eval time =    8578.80 ms /   127 runs   (   67.55 ms per token,    14.80 tokens per second)
llama_print_timings:       total time =    8949.84 ms /   138 tokens
Log end

`-ngl 1`

Here is a reciepe for tomato soup:

Ingredients:- "
 [end of text]

llama_print_timings:        load time =     641.73 ms
llama_print_timings:      sample time =       1.47 ms /     7 runs   (    0.21 ms per token,  4768.39 tokens per second)
llama_print_timings: prompt eval time =     312.33 ms /    11 tokens (   28.39 ms per token,    35.22 tokens per second)
llama_print_timings:        eval time =     666.72 ms /     6 runs   (  111.12 ms per token,     9.00 tokens per second)
llama_print_timings:       total time =     983.36 ms /    17 tokens
Log end

Starts ok, but glitches after a few tokens generated. (in this case it generated an eos token, so it ended the generation early, but with a different prompt/higer temp, the output is just noisy gibberish)

using `-p "Here is a reciepe for tomato soup:\n\n"`

Here is a reciepe for tomato soup:

 - "Tomato SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe----------------------------- "-- SOUP
 Mince 1 onion and 2 cloves of garlic in a large pot over medium heat. Dump in 4 B cans of crushed tomatoes Pinch TThe-------------------------

`-ngl 2`

Here is a reciepe for tomato soup:

- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe-----------------------------------------------------
llama_print_timings:        load time =     562.83 ms
llama_print_timings:      sample time =      27.43 ms /   128 runs   (    0.21 ms per token,  4665.91 tokens per second)
llama_print_timings: prompt eval time =     304.00 ms /    11 tokens (   27.64 ms per token,    36.18 tokens per second)
llama_print_timings:        eval time =    8149.43 ms /   127 runs   (   64.17 ms per token,    15.58 tokens per second)
llama_print_timings:       total time =    8507.07 ms /   138 tokens
Log end

(ngl 2 to 32 all produce the same output, only the inference speed changes)

`-ngl 32`

Here is a reciepe for tomato soup:
- " S M D B P TThe-------------------------------------------------------
- " S M D B P TThe------------------------------------------------------
llama_print_timings:        load time =    1180.39 ms
llama_print_timings:      sample time =      32.76 ms /   128 runs   (    0.26 ms per token,  3906.97 tokens per second)
llama_print_timings: prompt eval time =     184.50 ms /    11 tokens (   16.77 ms per token,    59.62 tokens per second)
llama_print_timings:        eval time =    2464.90 ms /   127 runs   (   19.41 ms per token,    51.52 tokens per second)
llama_print_timings:       total time =    2707.77 ms /   138 tokens
Log end

`-ngl 33` (all layers)

Here is a reciepe for tomato soup:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
llama_print_timings:        load time =    1076.60 ms
llama_print_timings:      sample time =      34.70 ms /   128 runs   (    0.27 ms per token,  3688.97 tokens per second)
llama_print_timings: prompt eval time =     168.66 ms /    11 tokens (   15.33 ms per token,    65.22 tokens per second)
llama_print_timings:        eval time =    1424.31 ms /   127 runs   (   11.21 ms per token,    89.17 tokens per second)
llama_print_timings:       total time =    1652.01 ms /   138 tokens
Log end

(always repeating a single token, seems to use mostly '!', 'G' or 'o')

The text was updated successfully, but these errors were encountered:

0cc4m · 2024-01-31T20:42:51Z

You seem to have a knack for finding issues with the Vulkan code. And I just fixed Phi.. I guess there's another matmul issue with the Windows AMD driver? I'll try to find it.

0cc4m · 2024-01-31T22:03:40Z

ERROR: Invalid value in UNARY i3=0 i2=0 i1=0 i0=7914 result=-nan correct=20.625 nmse=6.81852e-08
tensor=0x60b8b73eb6b0 tensor->name=ffn_gelu-29 tensor->backend: 10 tensor->type: f32 ne0=10240 nb0=4 ne1=2 nb1=40960 ne2=1 nb2=81920 ne3=1 nb3=81920 offset=0
src0=0x60b8b73eb520 src0->name=ffn_up_b-29 op=ADD type=f32 backend=10 ne0=10240 nb0=4 ne1=2 nb1=40960 ne2=1 nb2=81920 ne3=1 nb3=81920 offset=0
First error: result=-1 correct=-1 i3=-1 i2=-1 i1=-1 i0=-1

Result:
               0       1       2       3       4       5       6       7       8       9
   7909:    0,99    0,99
   7910:   -0,15   -0,15
   7911:   -0,17   -0,17
   7912:   -0,15   -0,15
   7913:   -0,17   -0,17
   7914:    -nan    -nan
   7915:   -0,03   -0,03
   7916:   -0,14   -0,14
   7917:   -0,14   -0,14
   7918:   -0,14   -0,14

Correct:
               0       1       2       3       4       5       6       7       8       9
   7909:    0,99    0,99
   7910:   -0,15   -0,15
   7911:   -0,17   -0,17
   7912:   -0,15   -0,15
   7913:   -0,17   -0,17
   7914:   20,62   20,62
   7915:   -0,03   -0,03
   7916:   -0,14   -0,14
   7917:   -0,14   -0,14
   7918:   -0,14   -0,14

Seems to be the GELU shader this time, which AMD's proprietary driver doesn't like. Let me know if anyone spots the likely cause of that NaN.

ggerganov · 2024-02-01T09:01:08Z

For the Metal kernel, we had to explicitly call a more precise tanh call: precise::tanh

llama.cpp/ggml-metal.metal

Lines 266 to 277 in 1cfb537

    
           kernel void kernel_gelu( 
        
               device const float4 * src0, 
        
               device       float4 * dst, 
        
               uint tpig[[thread_position_in_grid]]) { 
        
               device const float4 & x = src0[tpig]; 
        
               // BEWARE !!! 
        
               // Simply using "tanh" instead of "precise::tanh" will sometimes results in NaNs! 
        
               // This was observed with Falcon 7B and 40B models 
        
               // 
        
               dst[tpig] = 0.5f*x*(1.0f + precise::tanh(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x))); 
        
           }

Without this change, it was producing NaNs

0cc4m · 2024-02-01T09:26:21Z

For the Metal kernel, we had to explicitly call a more precise tanh call: precise::tanh

Thank you, that's helpful. But I don't think I have any other implementation of tanh available. The GPU driver provides the implementation and it does work on most of them. I'll try to think of a workaround for the proprietary AMD driver.

stduhpf · 2024-02-01T14:24:38Z

I'm not sure how bad this would be for performance (because of branching and all that) and accuracy, but what about using sign(x)when abs(x) gets big enough for the difference to be insignificant, and tanh(x) only for smaller absolute values of x?

0cc4m · 2024-02-01T16:04:24Z

I'm not sure how bad this would be for performance (because of branching and all that) and accuracy, but what about using sign(x)when abs(x) gets big enough for the difference to be insignificant, and tanh(x) only for smaller absolute values of x?

Thanks for the suggestion. I found an even better one that seems to work: tanh(x) = 1 - 2 / (exp(2 * x) + 1). I think that might also be what Metal is doing if you use precise::tanh. I'll open a PR in a few minutes, would be nice if you could test it @stduhpf

stduhpf added the bug-unconfirmed label Jan 31, 2024

0cc4m self-assigned this Jan 31, 2024

0cc4m added bug Something isn't working and removed bug-unconfirmed labels Feb 1, 2024

0cc4m mentioned this issue Feb 1, 2024

Vulkan Phi Fix for AMD Proprietary Drivers #5260

Merged

0cc4m closed this as completed in #5260 Feb 1, 2024

stduhpf mentioned this issue Apr 24, 2024

Vulkan: possible NaN propagation on llama-3 8B (more testing required) #6874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-2 completely broken on Vulkan #5243

Phi-2 completely broken on Vulkan #5243

stduhpf commented Jan 31, 2024 •

edited

Loading

0cc4m commented Jan 31, 2024

0cc4m commented Jan 31, 2024

ggerganov commented Feb 1, 2024

0cc4m commented Feb 1, 2024

stduhpf commented Feb 1, 2024 •

edited

Loading

0cc4m commented Feb 1, 2024

Phi-2 completely broken on Vulkan #5243

Phi-2 completely broken on Vulkan #5243

Comments

stduhpf commented Jan 31, 2024 • edited Loading

.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X

-ngl 0 (control)

-ngl 1

-ngl 2

-ngl 32

-ngl 33 (all layers)

0cc4m commented Jan 31, 2024

0cc4m commented Jan 31, 2024

ggerganov commented Feb 1, 2024

0cc4m commented Feb 1, 2024

stduhpf commented Feb 1, 2024 • edited Loading

0cc4m commented Feb 1, 2024

stduhpf commented Jan 31, 2024 •

edited

Loading

`.\buildVulkan\bin\Release\main.exe -m .\models\phi\phi-2.Q4_K_M.gguf -t 12 -tb 6 -p "Here is a reciepe for tomato soup:\n" -e -s 0 --temp 0 -n 128 -ngl X`

`-ngl 0` (control)

`-ngl 1`

`-ngl 2`

`-ngl 32`

`-ngl 33` (all layers)

stduhpf commented Feb 1, 2024 •

edited

Loading