-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small optimizations to CPU utilization #130
Comments
Hi, thank you for sharing this optimization, I don't know a lot about hyperthreading, so will it be possible for you to write modifications for a general CPU architecture if that is even possible? Thanks! |
Hi, thank you for your answer! I don’t think I can actually write this for a general architecture right now (I don’t know the Python-APIs well enough to know where to find the number of virtual and physical cores). I hope that this here can give someone with the experience with the API the required pointers. What I basically did: |
The 0.8 is just empirical. |
Sidenote: I run this with a file of prompts (TODO_prompts.txt) and then call this:
|
Not bad! I put this at the beginning of txt2img_gradio.py and everything runs much faster.
It retuns logical cpus (threads), in my case I have not HT. |
How did you check that it runs faster? In my case the full CPU couont just consumed more CPU but was slower. (please benchmark the full creation! I simple In my case I had to reduce the number of CPUs because using the full number was actually slower (I guess that it was competing too much with other processes on my system and maybe with itself). |
To test only cpu i run Best times of some attempts:
Well, I have an i5 2500k with 4 cores without HT, 1 core less is like a drop in performance of up to 25% for each task, that's why I say that with all the cores it runs much faster in my case, and in gradio ui there is an animation progress that wants to take a cpu core if i leave one idle. I think it could be an optional parameter and leave the default as it is to take all cores. |
Does os.cpu_count() return 4 in your case? Then this would be expected. I have 6 physical CPU cores, but set 10, because hyperthreading allows the CPUs limited optimizations in cases where one process on the CPU would have idle time because the parallelism of the code isn’t an exact match to the possibilities in the chip. I reduce by 20%, because hyperthreading can overcommit CPUs and then the processes can block each other. I use only the physical cores -1 for |
One way to only adjust threading on hyperthreaded systems is: # get_num_threads defaults to physical cores, while os.cpu_count reports
# logical cores. Only adjust thread count on hyperthreaded systems:
if opt.device == "cpu" and torch.get_num_threads() != os.cpu_count:
torch.set_num_threads(int(os.cpu_count()*0.8))
torch.set_num_interop_threads(int(os.cpu_count()*0.8)) Works on windows - which I was worried about, should work on linux as well. I'm only seeing about a 10% increase over just physical cores, on Ryzen 5950X -- seems to be memory bound. |
@bitRAKE Thank you! I would keep the interop-threads lower. These are likely not operating on the same memory regions, so they do not benefit as much from potentially shared caching when on the same physical CPU. My setup would rather be: # get_num_threads defaults to physical cores, while os.cpu_count reports
# logical cores. Only adjust thread count on hyperthreaded systems:
if opt.device == "cpu" and torch.get_num_threads() != os.cpu_count:
physical_cores = torch.get_num_threads()
torch.set_num_threads(int(os.cpu_count()*0.8))
# reduced interop-threads to leave one physical CPU for other tasks like filesystem IO
torch.set_num_interop_threads(int(physical_cores*0.8)) |
Unfortunately, I'm seeing drastic memory thrashing under Windows - massive allocation swings of 10+GB and this will consume most of the time on larger images. Pytorch should (imho) maintain it's own memory pool. When memory is released to Windows it wants to clear the memory before any application can have it - which is an absurd security requirement. Since pytorch is just going to use the memory again it should hold on to it. I'll need to research further to see if settings exist to prevent this kind of memory thrashing. I'm new to all this python stuff, but motivated. :) |
make web ui default to 512x512
I tweaked the CPU code to reduce runtime by about 20%. This is not ready for merge, because it relies on my local CPU cores and only works with hyperthreading, but I wanted to share it anyway:
The text was updated successfully, but these errors were encountered: