Small optimizations to CPU utilization #130

ArneBab · 2022-09-07T20:48:54Z

I tweaked the CPU code to reduce runtime by about 20%. This is not ready for merge, because it relies on my local CPU cores and only works with hyperthreading, but I wanted to share it anyway:

diff --git a/optimizedSD/optimized_txt2img.py b/optimizedSD/optimized_txt2img.py
index 6ead861..3a795a3 100644
--- a/optimizedSD/optimized_txt2img.py
+++ b/optimizedSD/optimized_txt2img.py
@@ -34,8 +34,16 @@ def load_model_from_config(ckpt, verbose=False):
 
 
 config = "optimizedSD/v1-inference.yaml"
ckpt = "models/ldm/stable-diffusion-v1/model.ckpt"
+# try allowing hyperthreading: 6 physical CPUs => 12 virtual cores, use one less
+# assumes that the CPUs are not used optimally, might actually slow the system down
+# but it is actually faster: at 11 threads 16s instead of 22s for an iteration for two samples.
+# but 11 significantly slows down the computer.
+torch.set_num_threads(10)
+# limit inter_op threads to the physical CPUs to have more for intra-op (which on my CPUs hopefully has better caching)
+# This gets the time per iteration down to 15s
+torch.set_num_interop_threads(5)
 
 parser = argparse.ArgumentParser()

The text was updated successfully, but these errors were encountered:

basujindal · 2022-09-08T16:43:05Z

Hi, thank you for sharing this optimization, I don't know a lot about hyperthreading, so will it be possible for you to write modifications for a general CPU architecture if that is even possible? Thanks!

ArneBab · 2022-09-08T19:34:36Z

Hi, thank you for your answer! I don’t think I can actually write this for a general architecture right now (I don’t know the Python-APIs well enough to know where to find the number of virtual and physical cores). I hope that this here can give someone with the experience with the API the required pointers.

What I basically did: torch.set_num_threads (int(0.8*virtual_cores)) and torch.set_num_interop_threads(int(0.8*physical_cores)).

ArneBab · 2022-09-08T19:34:50Z

The 0.8 is just empirical.

ArneBab · 2022-09-08T19:36:17Z

Sidenote: I run this with a file of prompts (TODO_prompts.txt) and then call this:

cat TODO_prompts.txt | xargs -I {} nice -n 2 python optimizedSD/optimized_txt2img.py --device cpu --precision full --prompt "{}" --H 512 --W 512 --n_iter 1 --n_samples 2 --ddim_steps 75

konimaki2022 · 2022-09-08T21:06:23Z

Not bad! I put this at the beginning of txt2img_gradio.py and everything runs much faster.

torch.set_num_threads(os.cpu_count())
torch.set_num_interop_threads(os.cpu_count())

It retuns logical cpus (threads), in my case I have not HT.

ArneBab · 2022-09-09T07:19:12Z

How did you check that it runs faster? In my case the full CPU couont just consumed more CPU but was slower. (please benchmark the full creation! I simple time python optimizedSD/... should be enough to get an idea; you’ll want to repeat that to get better info)

In my case I had to reduce the number of CPUs because using the full number was actually slower (I guess that it was competing too much with other processes on my system and maybe with itself).

konimaki2022 · 2022-09-10T11:59:50Z

To test only cpu i run time python3 optimizedSD/optimized_txt2img.py --prompt "david beckam, oil_painting, headshot" --H 512 --W 512 --n_iter 1 --n_samples 1 --ddim_steps 10 --turbo --precision full --device cpu

Best times of some attempts:

torch.set_num_threads(os.cpu_count()-1)
torch.set_num_interop_threads(os.cpu_count()-1)

real	4m25,051s
user	12m47,141s
sys	 1m35,402s

torch.set_num_threads(os.cpu_count())
torch.set_num_interop_threads(os.cpu_count())

real	3m51,132s
user	11m39,489s
sys	 1m16,910s

Well, I have an i5 2500k with 4 cores without HT, 1 core less is like a drop in performance of up to 25% for each task, that's why I say that with all the cores it runs much faster in my case, and in gradio ui there is an animation progress that wants to take a cpu core if i leave one idle.

I think it could be an optional parameter and leave the default as it is to take all cores.

ArneBab · 2022-09-10T13:41:26Z

Does os.cpu_count() return 4 in your case? Then this would be expected. I have 6 physical CPU cores, but set 10, because hyperthreading allows the CPUs limited optimizations in cases where one process on the CPU would have idle time because the parallelism of the code isn’t an exact match to the possibilities in the chip.

I reduce by 20%, because hyperthreading can overcommit CPUs and then the processes can block each other.

I use only the physical cores -1 for interop_threads (so 5 instead of 6), because I guess (yes, guess) that the inner-operation multithreading can better utilize virtual CPUs that actually run on the same hardware so they have the same caches.

bitRAKE · 2022-09-10T23:08:00Z

One way to only adjust threading on hyperthreaded systems is:

# get_num_threads defaults to physical cores, while os.cpu_count reports
# logical cores.  Only adjust thread count on hyperthreaded systems:
if opt.device == "cpu" and torch.get_num_threads() != os.cpu_count:
    torch.set_num_threads(int(os.cpu_count()*0.8))
    torch.set_num_interop_threads(int(os.cpu_count()*0.8))

Works on windows - which I was worried about, should work on linux as well. I'm only seeing about a 10% increase over just physical cores, on Ryzen 5950X -- seems to be memory bound.

ArneBab · 2022-09-11T07:07:51Z

@bitRAKE Thank you! I would keep the interop-threads lower. These are likely not operating on the same memory regions, so they do not benefit as much from potentially shared caching when on the same physical CPU.

My setup would rather be:

# get_num_threads defaults to physical cores, while os.cpu_count reports
# logical cores.  Only adjust thread count on hyperthreaded systems:
if opt.device == "cpu" and torch.get_num_threads() != os.cpu_count:
    physical_cores = torch.get_num_threads()
    torch.set_num_threads(int(os.cpu_count()*0.8))
    # reduced interop-threads to leave one physical CPU for other tasks like filesystem IO
    torch.set_num_interop_threads(int(physical_cores*0.8))

bitRAKE · 2022-09-11T10:08:37Z

Unfortunately, I'm seeing drastic memory thrashing under Windows - massive allocation swings of 10+GB and this will consume most of the time on larger images. Pytorch should (imho) maintain it's own memory pool. When memory is released to Windows it wants to clear the memory before any application can have it - which is an absurd security requirement. Since pytorch is just going to use the memory again it should hold on to it. I'll need to research further to see if settings exist to prevent this kind of memory thrashing. I'm new to all this python stuff, but motivated. :)

make web ui default to 512x512

kylewlacy pushed a commit to kylewlacy/stable-diffusion that referenced this issue Sep 23, 2022

Merge pull request basujindal#130 from bakkot/patch-1

44a48d0

make web ui default to 512x512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small optimizations to CPU utilization #130

Small optimizations to CPU utilization #130

ArneBab commented Sep 7, 2022

basujindal commented Sep 8, 2022

ArneBab commented Sep 8, 2022

ArneBab commented Sep 8, 2022

ArneBab commented Sep 8, 2022

konimaki2022 commented Sep 8, 2022

ArneBab commented Sep 9, 2022 •

edited

Loading

konimaki2022 commented Sep 10, 2022

ArneBab commented Sep 10, 2022 •

edited

Loading

bitRAKE commented Sep 10, 2022 •

edited

Loading

ArneBab commented Sep 11, 2022 •

edited

Loading

bitRAKE commented Sep 11, 2022

Small optimizations to CPU utilization #130

Small optimizations to CPU utilization #130

Comments

ArneBab commented Sep 7, 2022

basujindal commented Sep 8, 2022

ArneBab commented Sep 8, 2022

ArneBab commented Sep 8, 2022

ArneBab commented Sep 8, 2022

konimaki2022 commented Sep 8, 2022

ArneBab commented Sep 9, 2022 • edited Loading

konimaki2022 commented Sep 10, 2022

ArneBab commented Sep 10, 2022 • edited Loading

bitRAKE commented Sep 10, 2022 • edited Loading

ArneBab commented Sep 11, 2022 • edited Loading

bitRAKE commented Sep 11, 2022

ArneBab commented Sep 9, 2022 •

edited

Loading

ArneBab commented Sep 10, 2022 •

edited

Loading

bitRAKE commented Sep 10, 2022 •

edited

Loading

ArneBab commented Sep 11, 2022 •

edited

Loading