Wrong estimation of how many gpu layers fit in VRAM on RX 7900 XT? #88
Replies: 3 comments 2 replies
-
I'll look into it, but my guess is that going from 16k context to 64k context would bring it too close to the 20gb max your GPU has and because the estimate layer feature has a built in safety-net to allow for other programs running on your pc (like the graphics themselves) that it sees 64k as too much. |
Beta Was this translation helpful? Give feedback.
-
Update: This time it seems the detection works correctly. I did notice however that I had to disabled the iGPU in the CPU for it to work, otherwise I would get this error:
even if I selected the right GPU in the GUI. |
Beta Was this translation helpful? Give feedback.
-
Thank you and I wish you some nice Christmas Holidays too - hopefully you are able to take some days off... Good to know that disabling iGPU fix comes from AMD. Thanks to the 'sick' system. Runs quite nicely for now. Should support two GPUs in the future... |
Beta Was this translation helpful? Give feedback.
-
I use the current version 1.79.1.yr0 for windows.
When using it with my Radeon RX 7900 XT which has 20gb of VRAM it seems the estimation of layers that fit in the VRAM seem wrong.
My partner who uses the same setting and same file with a 7900 XTX with 24gb of VRAM the estimation seems more adequate.
For instance if I choose L3-8B-Stheno-v3.2-Q6_K-imat.gguf with 16k context (at 24k context it tells me not all layers fit) just under 13gb of 20 VRAM is used, so quite a bit is left free.
For him 64k context size is allowed and still all layers are selected per default. After loading it like this almost exactly 20GB of his VRAM is used up.
Maybe the program thinks this card has just 16gb of VRAM?
How does the estimation work?
How much for the model and how much per K of context?
I realize a 24gb card can fit more than a 20gb card, but the estimation seems off.
I could load the model Mistral-Small-22B-ArliAI-RPMax-v1.1-Q4_K_M.gguf with 16k context fully into VRAM and start a conversation where the initial 5782 tokens properly and quickly.
This used up 18940mb of VRAM (as displayed in GPU_Z).
Maybe there are issues when more context is used, but still the estimation seems off.
Can you check?
Maybe there is an improvement necessary (the authors may not have tested this tool with this card)?
Beta Was this translation helpful? Give feedback.
All reactions