Keep VRAM usage and faster slicing consistent in attention.py #582

mh-dm · 2022-09-14T22:32:36Z

EDIT:
Refactor attention.CrossAttention to remove 4 copies of einsum_op_compvis code.

Remove zeros tensor that's useless when we have enough VRAM.
For cuda keep VRAM usage and faster slicing consistent.

mh-dm · 2022-09-14T23:02:58Z

I think this should also fix report from @Kolaer in #486 where it was required to remove raise RuntimeError line.

mh-dm · 2022-09-14T23:04:20Z

FYI @Any-Winter-4079 since you edited this recently.

i3oc9i · 2022-09-15T06:04:36Z

I have tested this PR on MacOs

512x512. -s50 / OK but there is not improvment in speed
896x576 -s50 / OK but there is not improvment in speed
1024x1024 -s50 / FAIL with error Error: product of dimension sizes > 2**31'

Any-Winter-4079 · 2022-09-15T08:13:10Z

Yeah, this is going to be tricky :)

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

I do agree there is probably a unified way to do this, though. I'm trying to get Textual Inversion working on M1, but I'll try to take a look. Thanks!

Vargol · 2022-09-15T08:24:47Z

Going to be a while before I can get around to testing, I notice a lot of the VRAM reduction fixes have been remove which concerns me, but I guess the move to functions means they should fall out of scope and get collected so it might not be a disaster.

netsvetaev · 2022-09-15T08:33:07Z

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

My problems with ram may be related to non-arm brew and installation process differences. So I need to reinstall it. Right now fresh install do cause some errors though.

0t0m0 · 2022-09-15T08:39:21Z

@Any-Winter-4079
Thumbs down from me :(
The 512x512 image was created without problems in the normal time.
However, the 1024x1024 image took half the expected time and was colorful noise afterwards.

dream> banana sushi -W512 -H512 -Ak_lms -S50
100%|███████████████████████████████████████████| 50/50 [00:27<00:00,  1.83it/s]
Generating: 100%|█████████████████████████████████| 1/1 [00:31<00:00, 31.24s/it]
>> Usage stats:
>>   1 image(s) generated in 31.37s
Outputs:
[2] outputs/img-samples/000115.50.png: "banana sushi" -s50 -W512 -H512 -C7.5 -Ak_lms -S50

dream> banana sushi -W1024 -H1024 -Ak_lms -S50
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
100%|███████████████████████████████████████████| 50/50 [07:34<00:00,  9.09s/it]
Generating: 100%|████████████████████████████████| 1/1 [07:52<00:00, 472.54s/it]
>> Usage stats:
>>   1 image(s) generated in 472.67s
Outputs:
[3] outputs/img-samples/000116.50.png: "banana sushi" -s50 -W1024 -H1024 -C7.5 -Ak_lms -S50

Edited:
The fast processing is probably not the problem.
Even on the Developer Branch, the working time is now only about 5.5 minutes for a 1024x1024 image.
Three days ago it was still about 22 minutes 😮

jroxendal · 2022-09-15T08:41:26Z

Non-scientific findings (i.e I shut down some of the worst memory hogs but didn't do a full restart to clear up memory)
32GB m1 max:
512x512 same as development, so about 35s.
896x576 hovering around 20s/it.
1024x1024 suspiciously fast at around 5s/it and generated colourful noise.

Vargol · 2022-09-15T09:21:00Z

Sorry, can't test, seem to be in dependancy hell, I think something has been upgraded to sue a newer version of protobuff
that incompatible with a bunch of other dependancies

netsvetaev · 2022-09-15T09:34:33Z

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

My problems with ram may be related to non-arm brew and installation process differences. So I need to reinstall it. Right now fresh install do cause some errors though.

After a fresh install I've got 1.38s/it 512, 6.20s/it 768 and 20.5s/it 1024 (main branch). So it were mine problems with something non-arm.

netsvetaev · 2022-09-15T10:09:29Z

It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

Development:
512 at 1.04it/s (less than a second per it). 50s total.
768 at 6.20. 1024 at 22-24s/it.

This attention.py version:
512 at 1.04s/it (slower)
768 at 10s/it
1024 had crashed dream.py with this:

/AppleInternal/Library/BuildRoots/5381bdfb-27e8-11ed-bdc1-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'
Abort trap: 6
/Users/artur/miniconda3/envs/ldm10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

lstein · 2022-09-15T11:18:07Z

Tag me when this is ready for review. You might want to convert this PR into a draft if you're still actively working on it.

Vargol · 2022-09-15T11:33:46Z

Sorted my dependency hell.

512x512, around normal speed
1024x1024 was running little slow (67s/it cf 55s/it) then threw an error in 35th sample
Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

mh-dm · 2022-09-15T11:52:48Z

Thank you for testing this!
I see there's an issue with 1024x1024 / large sizes product of dimension sizes > 2**31 for mps (note 1024x1024 runs fine on just cpu) and I'll address that (looks like it was handled through slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1]) before my change).
@netsvetaev Surprised about any slowdown. This change is mostly a refactoring, especially for 512x512 where it should trigger the same algorithm with no slicing. I read you have 16GB, so for 768x768 size it should use larger slices. I'm guessing larger slices are slower, which is okay as I have to limit them to fix the above anyway.

Vargol · 2022-09-15T12:23:53Z

Ooops was testing the wrong thing, thats what you get for dealing with dependancy problems during breaks from a training course lol.

Its actually a disaster...512x512 currently doing 18S/it down from 5.5-6.0s/it will let it run a while as the first few samples can be a bit slow, but it normally settles down by where I've got too.

krummrey · 2022-09-15T12:25:08Z

@netsvetaev I'm in the office right now, will try it later this evening.

Vargol · 2022-09-15T12:26:19Z

Darn, just realised that means I'm getting those memory issues from main, wasn't getting any from the 1.14.3 test repo.

i3oc9i · 2022-09-15T12:37:13Z

@mh-dm
after force-pushed the attention branch from 297e70d to b93338a

banana split -s50 -W1024 -H1024 works but it is almost two times slower than 1.14.1 release (2m55s)

50/50 [05:43<00:00,  6.87s/it]
1 image(s) generated in 345.72s

banana split -s50 -W576 -H896 works and it s just a bit slower (50s in 1.14.1 relese)

50/50 [00:54<00:00,  1.09s/it]
1 image(s) generated in 55.06s

banana split -s50 -W512 -H512 works at the same speed of 1.14.1 relese

50/50 [00:22<00:00,  2.25it/s]
1 image(s) generated in 22.83s

Vargol · 2022-09-15T12:51:42Z

On, I can confirm in now for me 512x512 is 400% slower

And the reason for that is its going into the compVis calc when I basically need to be a slice 0 for any size image, I'm amazed I actually get any thing at all.

Hopefully @mh-dm is beginning to appreciate the code was complex for a very good reason :-)

lstein · 2022-09-15T14:56:13Z

Folks, also keep in mind that we have to maintain this code over the long term. Even a 10% performance increase may not be worth it if it causes a 50% increase in maintenance time to track down bugs or add new features. At some point this exercise becomes subject to diminishing returns.

mh-dm · 2022-09-15T15:06:56Z

@i3oc9i @Vargol Thank you, very good data. Widely different from my experience but I'm starting to understand what's going on. For my cpu I see only minor variation between the slicing or not and at at different sizes while testing 512->1024. My i7 has a measly 8MB L3 cache, fully blown even at just 512x512 with slicing. Whereas "M1 Pro and M1 Max have 24 MB and 48 MB respectively of system level cache (SLC). The M1 Ultra combines two M1 Max chips in one package[15] for a total of 20 CPU cores and 96 MB system level cache (SLC)".
Anyone knows an easy way to get L3 cache size in python/torch? Also might help to run lscpu | grep cache and post output if significantly different from what's already posted by others.

L2 cache:                        1 MiB
L3 cache:                        8 MiB

i3oc9i · 2022-09-15T18:09:12Z

There is not lscpu in MacOs M1 sorry

mh-dm · 2022-09-15T18:11:01Z

Can you try getconf -a | grep CACHE_SIZE? I get

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_DCACHE_SIZE                 32768
LEVEL2_CACHE_SIZE                  262144
LEVEL3_CACHE_SIZE                  8388608
LEVEL4_CACHE_SIZE                  0

krummrey · 2022-09-15T18:27:59Z

Can someone give me a hand, how do I check out a pull request?
git pull origin pull/582/head:development didn't work.

mh-dm · 2022-09-15T18:31:49Z

Maybe just git pull origin pull/582/head?

krummrey · 2022-09-15T19:43:48Z

I ran it 3 time each (-n3):
512x512 - 1:02, 1:04, 1:04
640x640 - 3:33, 3:39, 3:27
No improvement here, even a slight increase in render times.

Vargol · 2022-09-17T14:34:33Z

Yes, ignoring that the the dev branch was a 400% slow down from main due to that typo:-)

But the point of that post was to prove to myself, even if no one else, that the issues I'm seeing with '-W704 -H512 ' is nothing to do with this PR.

Any-Winter-4079 · 2022-09-17T17:05:33Z

@mh-dm Yep, you were right that it is almost the exactly same code for 16GB.

This returns your memory, e.g. 16 for 16GB.

With 16GB, you'd go through einsum_op_mps_v2

For images 512x512, run einsum_op_compvis

which is

Else, use einsum_op_slice_0 with slice_size 1

Compared to current dev branch code (with >=8 typo, which should read >8)

Maybe the available RAM at any point affects performance (e.g. one day may report faster results than a different day, because of other apps, etc. running).

If the different results from dev branch were from the same day, I'm not sure what it could be. Maybe calling the einsum_op_compvis function with portions of the data, e.g. r[i:end] = self.einsum_op_compvis(q[i:end], k[i:end], v[i:end]) (which I don't know how it works -maybe the slice of data is copied using ranges?)

Other than that, I really don't see any difference.

mh-dm · 2022-09-17T17:07:26Z

Thanks for looking. Can I get the github review approve so I can submit then?

Oh and to answer the question, slices like q[i:end] are not copies.

Any-Winter-4079

This commit gives 6-10% fasters results for 64-128GB (10% at larger sizes) M1 and fixes a typo for 8GB M1. Performance on 32GB M1 has not been tested (maybe I missed it in the Conversation), but will run the same code as 64-128GB, which is the behavior already present on the dev branch. For 16GB M1, there are mixed results (more testing would be needed).

Any-Winter-4079 · 2022-09-17T17:15:28Z

Thanks for looking. Can I get the github review approve so I can submit then?

Oh and to answer the question, slices like q[i:end] are not copies.

By the way, have you tested on CUDA?

mh-dm · 2022-09-17T17:19:05Z

Yeah, tested ~6% faster on 8gb nvidia card for 512x512. I tested various sizes until 1024x1024 at which point it was ~10% faster. Intel v7 cpu got the highest improvement, I think ~25% overall for 512x512 down to 7-8s/it though.

Any-Winter-4079 · 2022-09-17T17:31:38Z

@krummrey @netsvetaev
By the way, because 16GB have reported mixed results, if anyone with 16GB RAM tests (same time of day, same apps open/closed, exiting dream> after every test, leaving a few minutes not to overheat -which is a real thing for me at 64GB impacting performance) and there seems to be a consistent slowdown, you can always do a Pull request to add your own code to run for 16GB Mac. For example here

Just need to add an if, e.g. if self.mem_total_gb==16 and then just call the current code you are running (or an even more improved version). The current code (before this merge) is explained here. #582 (comment) 👍
With this I mean that we don't want to leave anyone behind :) If you don't know how to do a pull request (I didn't, a few days ago), just open an issue, report your experiments/results, share the version of the code that works best and even I can submit a pull request to merge it into the development branch.

mh-dm · 2022-09-17T17:46:40Z

One thing I'd try when experimenting with optimizations is changing > 8 into > 16 to always call slice_0.

Another thing to try would be:

    def einsum_op_mps_v2(self, q, k, v):
        if self.mem_total_gb <= 8:
            return self.einsum_op_slice_0(q, k, v, 1)
        return self.einsum_op_tensor_mem(q, k, v, 64)

Where you try different values, ex: 64, 48, 32, 24, 16.

netsvetaev · 2022-09-17T19:41:53Z

Seems it's got faster on the latest dev:
1.11it/s 512 (was 1.04 maximum), 45s total
5.25s/it 768 (was 6.20), 04:22s total
That's impressive!

For 1024 I have same numbers from 23s, increasing with every sample up to 25-26 (was 19s as the best result earlier). So +5 minutes in total. Strange.

@mh-dm @Any-Winter-4079 I'll try to edit the code, thank you for the tips.

i3oc9i · 2022-09-18T01:46:50Z

on last commit e040c35

"banana split" -s50 -W512 -H512 -C7.5 -Ak_lms. (OK 0:20) same as previous report
"banana split" -s50 -W896 -H576 -C7.5 -Ak_lms (OK 0:50 ) same as previous report
"banana split" -s50 -W1024 -H1024 -C7.5 -Ak_lms (OK 2:44) same as previous report

Birch-san · 2022-09-18T12:50:57Z

I've tried this plus #519 out on my fork:
Birch-san/stable-diffusion@18bb5f8...optimized-attention

using very recent PyTorch nightly: 1.13.0.dev20220917 on M1 Max (64GB).
8 steps, Heun sampler.
and I always ran the "optimized" test first, to give it the best chance (i.e. lower temperature).

1 sample:

optimized:
11.4 secs

non-optimized:
10.4 secs

3 samples:

optimized:
37.8 secs

non-optimized:
30.2 secs

Conclusion

Optimized attention is 10~25% slower than "CompVis original code + some dels".
not necessarily due to this PR. I got similar results trying out the algorithm from the previous PR:
#540 (comment)

Vargol · 2022-09-18T12:58:38Z

@Birch-san Have you tried in with a release pytorch I see significant slowdown with the nightlies.

Birch-san · 2022-09-18T13:08:45Z

using PyTorch stable 1.12.1

1 sample

optimized:
10.1 secs

non-optimized:
9.8 secs

3 samples

optimized:
31.2 secs

non-optimized:
28.7 secs

Conclusion

Optimized attention is 3~9% slower than "CompVis original code + some dels".
PyTorch stable is 5~6% faster than recent nightly.

the multi-sample tests may be a bit unfair because the heuristic is designed to study pixel dimensions rather than number of samples or conditions. but still, the single-sample tests are fair game.

Birch-san · 2022-09-18T14:48:18Z

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB. 😛

Birch-san · 2022-09-18T14:56:30Z

using PyTorch stable 1.12.1

1 sample

optimized:
9.8 secs

non-optimized:
9.8 secs

3 samples

optimized:
26.9 secs

non-optimized:
28.7 secs

Conclusion

If you go down the right code path (i.e. quitting apps to free up 32GB of memory), then optimized attention is identical speed for 1 sample, and 6.7% faster for a batch-of-3 samples.
that's 9.0 secs per sample, so batch-of-3 is 8.8% faster per sample than batch-of-1.

Birch-san · 2022-09-18T15:05:17Z

btw, I think I see a way to make inference faster.
instead of allocating new tensors each time: we could retain and re-use tensors.
it'd be a bit of a mess though. potentially increases the peak VRAM use (because you'd have a few large tensors that you'd never free).

I also wonder whether slapping .contiguous() on any of these tensors would make it cheaper to do arithmetic on them. due to improved locality.

mh-dm · 2022-09-18T16:27:31Z

btw, I think I see a way to make inference faster. instead of allocating new tensors each time: we could retain and re-use tensors. it'd be a bit of a mess though. potentially increases the peak VRAM use (because you'd have a few large tensors that you'd never free).

pytorch really doesn't like modifying/re-using tensors. I had to re-land #569 with a .clone() fix so that training still works and even a simple *= scalar seemed to introduce issues for 1024x1024 on M1. Finally, there's not that much left to speed up unfortunately when the slowest operation/the bottleneck is s = s.softmax(dim=-1) for which there's no inplace version available in pytorch. If anyone reading has contributed to pytorch before and would like a nice rewarding challenge try adding support for softmax_ or softmax(inplace=True).

I also wonder whether slapping .contiguous() on any of these tensors would make it cheaper to do arithmetic on them. due to improved locality.

Less chance of seeing significant improvements as you'd be introducing a copy. Plus I think reshape() already results in contiguous copy.

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB

You have the configuration to try out M1 16GB+ specific optimizations that Any-Winter-4079 and I have suggested, just a few messages up.

Any-Winter-4079 · 2022-09-19T00:32:58Z

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB. 😛

If you go down the right code path (i.e. quitting apps to free up 32GB of memory), then optimized attention is identical speed for 1 sample, and 6.7% faster for a batch-of-3 samples.

@Birch-san This new commit doesn't use mem_available on M1 though. It uses mem_total. You shouldn't need to quit apps nor get 31.89 of available memory (b/c it's mem_total that's measured).

See: self.mem_total_gb = psutil.virtual_memory().total // (1 << 30)

For me (64GB M1 Max), this code is about 6% faster for 512x512 and about 10% faster for 1024x1024.
As an example, for 512x512 I can go past 1.9it/s where I'd be lucky before to get 1.8.

…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.

mh-dm force-pushed the attention branch from ae00e4d to 297e70d Compare September 14, 2022 22:32

mh-dm marked this pull request as draft September 15, 2022 11:23

mh-dm force-pushed the attention branch from 297e70d to b93338a Compare September 15, 2022 12:03

mh-dm force-pushed the attention branch from b93338a to ac32c03 Compare September 15, 2022 18:35

Any-Winter-4079 approved these changes Sep 17, 2022

View reviewed changes

mh-dm merged commit e0951f2 into invoke-ai:development Sep 17, 2022

mh-dm deleted the attention branch September 17, 2022 17:49

mh-dm mentioned this pull request Sep 17, 2022

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364

Closed

Birch-san added a commit to Birch-san/stable-diffusion that referenced this pull request Sep 18, 2022

cherry-pick @mh-dm's invoke-ai/InvokeAI#582 refactor

765f2f7

Birch-san mentioned this pull request Sep 19, 2022

MPS backend is 5~6% slower on nightly builds — makes 78% more calls to aten::copy_, spends 72% more time there pytorch/pytorch#85297

Closed

Birch-san mentioned this pull request Sep 28, 2022

[MPS] einsum returns incorrect matmul result on first invocation on nightly builds pytorch/pytorch#85224

Closed

Any-Winter-4079 mentioned this pull request Oct 11, 2022

MPS support for doggettx-optimizations #431

Closed

Keep VRAM usage and faster slicing consistent in attention.py #582

Keep VRAM usage and faster slicing consistent in attention.py #582

Conversation

mh-dm commented Sep 14, 2022 • edited Loading

mh-dm commented Sep 14, 2022

mh-dm commented Sep 14, 2022

i3oc9i commented Sep 15, 2022 • edited Loading

Any-Winter-4079 commented Sep 15, 2022 • edited Loading

Vargol commented Sep 15, 2022

netsvetaev commented Sep 15, 2022

0t0m0 commented Sep 15, 2022 • edited Loading

jroxendal commented Sep 15, 2022

Vargol commented Sep 15, 2022

netsvetaev commented Sep 15, 2022

netsvetaev commented Sep 15, 2022

lstein commented Sep 15, 2022

Vargol commented Sep 15, 2022 • edited Loading

mh-dm commented Sep 15, 2022

Vargol commented Sep 15, 2022

krummrey commented Sep 15, 2022

Vargol commented Sep 15, 2022

i3oc9i commented Sep 15, 2022

Vargol commented Sep 15, 2022 • edited Loading

lstein commented Sep 15, 2022

mh-dm commented Sep 15, 2022 • edited Loading

i3oc9i commented Sep 15, 2022

mh-dm commented Sep 15, 2022 • edited Loading

krummrey commented Sep 15, 2022

mh-dm commented Sep 15, 2022

krummrey commented Sep 15, 2022

Vargol commented Sep 17, 2022

Any-Winter-4079 commented Sep 17, 2022 • edited Loading

mh-dm commented Sep 17, 2022 • edited Loading

Any-Winter-4079 left a comment

Choose a reason for hiding this comment

Any-Winter-4079 commented Sep 17, 2022

mh-dm commented Sep 17, 2022

Any-Winter-4079 commented Sep 17, 2022 • edited Loading

mh-dm commented Sep 17, 2022 • edited Loading

netsvetaev commented Sep 17, 2022 • edited Loading

i3oc9i commented Sep 18, 2022

Birch-san commented Sep 18, 2022 • edited Loading

1 sample:

3 samples:

Conclusion

Vargol commented Sep 18, 2022

Birch-san commented Sep 18, 2022 • edited Loading

1 sample

3 samples

Conclusion

Birch-san commented Sep 18, 2022

Birch-san commented Sep 18, 2022 • edited Loading

1 sample

3 samples

Conclusion

Birch-san commented Sep 18, 2022

mh-dm commented Sep 18, 2022

Any-Winter-4079 commented Sep 19, 2022 • edited Loading

mh-dm commented Sep 14, 2022 •

edited

Loading

i3oc9i commented Sep 15, 2022 •

edited

Loading

Any-Winter-4079 commented Sep 15, 2022 •

edited

Loading

0t0m0 commented Sep 15, 2022 •

edited

Loading

Vargol commented Sep 15, 2022 •

edited

Loading

Vargol commented Sep 15, 2022 •

edited

Loading

mh-dm commented Sep 15, 2022 •

edited

Loading

mh-dm commented Sep 15, 2022 •

edited

Loading

Any-Winter-4079 commented Sep 17, 2022 •

edited

Loading

mh-dm commented Sep 17, 2022 •

edited

Loading

Any-Winter-4079 commented Sep 17, 2022 •

edited

Loading

mh-dm commented Sep 17, 2022 •

edited

Loading

netsvetaev commented Sep 17, 2022 •

edited

Loading

Birch-san commented Sep 18, 2022 •

edited

Loading

Birch-san commented Sep 18, 2022 •

edited

Loading

Birch-san commented Sep 18, 2022 •

edited

Loading

Any-Winter-4079 commented Sep 19, 2022 •

edited

Loading