-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: [0.5.12] Cannot disable GPU offloading #4369
Comments
@pguyennet Thanks for reporting this! Let me explain what's happening: This makes sense technically, but I totally agree the UX could be clearer! |
Thanks for your answer @imtuyethan ! Can you tell me how to truly disable GPU acceleration then ? Because in my screenshot it is disabled but there is still the option ? If you mean that this option doesn't do anything when GPU acceleration is disabled can you tell me why it affects my inference speed ? Here are the figures I am talking about : ngl = 1 -> 17.8 tok/s Note : On ollama avx-512 cpu runner (ngl = 0 !) I've got 32 tok/s. Thanks ! |
@pguyennet Could you please share the log files and the settings.json file located in the app data folder? We'll investigate then. |
Hey sure here are the files as requested : Thanks again ! I love your work the sole thing that prevent me from switching is the lower inference speed compared to Ollama. |
Hi @pguyennet, there's another log file named |
Hi @louis-jan here you go : cortex.log |
@pguyennet Can you help me find the model yml file in the app data folder (models/source/author/repo..) and remove the What quantized version of the model are you using and inference parameters such as context_length, cpu_threads? on both sides. It seems you don't have avx-512 support but avx2 (but ye it backward compatible), cmiiw.
|
Thanks! |
Hey @louis-jan hope you had a nice week end ! So I removed the ngl line in the model.yml file and then created a new thread. The ngl option is still there but at 0 ! Output rate is still aroud 17-18 tokens/s but the big change is that when I modify the ngl value it doesn't affect the output rate anymore. I tried deleting and reimporting the model but I can't reproduce the affected output rate problem. Also Jan auto updated to 0.5.13 so I reinstalled the 0.5.12 to try again and I still can't reproduce the problem. I don't know what changed but seem like there was a misconfiguration somewhere. I should have started with that sorry for taking your time. Do you still want the model info and inference parameters ? Anyway thanks for your time ! |
Hi @pguyennet, I'd like to close, but could you share some details about the model quantization version you use when running on Jan and the model you use with Ollama? I'd like to reproduce myself here. |
Hey sure @louis-jan the model is granite-moe-3b at q8_0 from here : huggingface. I have a ryzen 7 6850u cpu (8c16T) and 16GB ram. Here are my ollama settings : (ollama show info)
And here are my Jan settings (idk how to export settings so here is a screenshot): Hope this helps ! |
Awesome, thanks @pguyennet |
Jan version
0.5.12
Describe the Bug
Can you help me disable GPU offloading ? I am talking about this setting :
I want to set it to 0.
In settings, GPU is disabled :
Thanks !
Steps to Reproduce
1 load model from GGUF
2 Locate ngl slider (bottom right, in model tab)
3 Try to disable/ set to 0
Screenshots / Logs
No response
What is your OS?
The text was updated successfully, but these errors were encountered: