Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep VRAM usage and faster slicing consistent in attention.py #582

Merged
merged 1 commit into from
Sep 17, 2022

Conversation

mh-dm
Copy link
Contributor

@mh-dm mh-dm commented Sep 14, 2022

EDIT:
Refactor attention.CrossAttention to remove 4 copies of einsum_op_compvis code.

Remove zeros tensor that's useless when we have enough VRAM.
For cuda keep VRAM usage and faster slicing consistent.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 14, 2022

I think this should also fix report from @Kolaer in #486 where it was required to remove raise RuntimeError line.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 14, 2022

FYI @Any-Winter-4079 since you edited this recently.

@i3oc9i
Copy link

i3oc9i commented Sep 15, 2022

I have tested this PR on MacOs

512x512. -s50 / OK but there is not improvment in speed
896x576 -s50 / OK but there is not improvment in speed
1024x1024 -s50 / FAIL with error Error: product of dimension sizes > 2**31'

@Any-Winter-4079
Copy link
Contributor

Any-Winter-4079 commented Sep 15, 2022

Yeah, this is going to be tricky :)

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

I do agree there is probably a unified way to do this, though. I'm trying to get Textual Inversion working on M1, but I'll try to take a look. Thanks!

@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

Going to be a while before I can get around to testing, I notice a lot of the VRAM reduction fixes have been remove which concerns me, but I guess the move to functions means they should fall out of scope and get collected so it might not be a disaster.

@netsvetaev
Copy link
Contributor

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

My problems with ram may be related to non-arm brew and installation process differences. So I need to reinstall it. Right now fresh install do cause some errors though.

@0t0m0
Copy link

0t0m0 commented Sep 15, 2022

@Any-Winter-4079
Thumbs down from me :(
The 512x512 image was created without problems in the normal time.
However, the 1024x1024 image took half the expected time and was colorful noise afterwards.

dream> banana sushi -W512 -H512 -Ak_lms -S50
100%|███████████████████████████████████████████| 50/50 [00:27<00:00,  1.83it/s]
Generating: 100%|█████████████████████████████████| 1/1 [00:31<00:00, 31.24s/it]
>> Usage stats:
>>   1 image(s) generated in 31.37s
Outputs:
[2] outputs/img-samples/000115.50.png: "banana sushi" -s50 -W512 -H512 -C7.5 -Ak_lms -S50

dream> banana sushi -W1024 -H1024 -Ak_lms -S50
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
100%|███████████████████████████████████████████| 50/50 [07:34<00:00,  9.09s/it]
Generating: 100%|████████████████████████████████| 1/1 [07:52<00:00, 472.54s/it]
>> Usage stats:
>>   1 image(s) generated in 472.67s
Outputs:
[3] outputs/img-samples/000116.50.png: "banana sushi" -s50 -W1024 -H1024 -C7.5 -Ak_lms -S50

Edited:
The fast processing is probably not the problem.
Even on the Developer Branch, the working time is now only about 5.5 minutes for a 1024x1024 image.
Three days ago it was still about 22 minutes 😮

@jroxendal
Copy link
Contributor

Non-scientific findings (i.e I shut down some of the worst memory hogs but didn't do a full restart to clear up memory)
32GB m1 max:
512x512 same as development, so about 35s.
896x576 hovering around 20s/it.
1024x1024 suspiciously fast at around 5s/it and generated colourful noise.

@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

Sorry, can't test, seem to be in dependancy hell, I think something has been upgraded to sue a newer version of protobuff
that incompatible with a bunch of other dependancies

@netsvetaev
Copy link
Contributor

For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

My problems with ram may be related to non-arm brew and installation process differences. So I need to reinstall it. Right now fresh install do cause some errors though.

After a fresh install I've got 1.38s/it 512, 6.20s/it 768 and 20.5s/it 1024 (main branch). So it were mine problems with something non-arm.

@netsvetaev
Copy link
Contributor

It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter.

Development:
512 at 1.04it/s (less than a second per it). 50s total.
768 at 6.20. 1024 at 22-24s/it.

This attention.py version:
512 at 1.04s/it (slower)
768 at 10s/it
1024 had crashed dream.py with this:

/AppleInternal/Library/BuildRoots/5381bdfb-27e8-11ed-bdc1-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'
Abort trap: 6
/Users/artur/miniconda3/envs/ldm10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@lstein
Copy link
Collaborator

lstein commented Sep 15, 2022

Tag me when this is ready for review. You might want to convert this PR into a draft if you're still actively working on it.

@mh-dm mh-dm marked this pull request as draft September 15, 2022 11:23
@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

Sorted my dependency hell.

512x512, around normal speed
1024x1024 was running little slow (67s/it cf 55s/it) then threw an error in 35th sample
Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 15, 2022

Thank you for testing this!
I see there's an issue with 1024x1024 / large sizes product of dimension sizes > 2**31 for mps (note 1024x1024 runs fine on just cpu) and I'll address that (looks like it was handled through slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1]) before my change).
@netsvetaev Surprised about any slowdown. This change is mostly a refactoring, especially for 512x512 where it should trigger the same algorithm with no slicing. I read you have 16GB, so for 768x768 size it should use larger slices. I'm guessing larger slices are slower, which is okay as I have to limit them to fix the above anyway.

@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

Ooops was testing the wrong thing, thats what you get for dealing with dependancy problems during breaks from a training course lol.

Its actually a disaster...512x512 currently doing 18S/it down from 5.5-6.0s/it will let it run a while as the first few samples can be a bit slow, but it normally settles down by where I've got too.

@krummrey
Copy link
Contributor

@netsvetaev I'm in the office right now, will try it later this evening.

@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

Darn, just realised that means I'm getting those memory issues from main, wasn't getting any from the 1.14.3 test repo.

@i3oc9i
Copy link

i3oc9i commented Sep 15, 2022

@mh-dm
after force-pushed the attention branch from 297e70d to b93338a

banana split -s50 -W1024 -H1024 works but it is almost two times slower than 1.14.1 release (2m55s)

50/50 [05:43<00:00,  6.87s/it]
1 image(s) generated in 345.72s

banana split -s50 -W576 -H896 works and it s just a bit slower (50s in 1.14.1 relese)

50/50 [00:54<00:00,  1.09s/it]
1 image(s) generated in 55.06s

banana split -s50 -W512 -H512 works at the same speed of 1.14.1 relese

50/50 [00:22<00:00,  2.25it/s]
1 image(s) generated in 22.83s

@Vargol
Copy link
Contributor

Vargol commented Sep 15, 2022

On, I can confirm in now for me 512x512 is 400% slower

And the reason for that is its going into the compVis calc when I basically need to be a slice 0 for any size image, I'm amazed I actually get any thing at all.

Hopefully @mh-dm is beginning to appreciate the code was complex for a very good reason :-)

@lstein
Copy link
Collaborator

lstein commented Sep 15, 2022

Folks, also keep in mind that we have to maintain this code over the long term. Even a 10% performance increase may not be worth it if it causes a 50% increase in maintenance time to track down bugs or add new features. At some point this exercise becomes subject to diminishing returns.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 15, 2022

@i3oc9i @Vargol Thank you, very good data. Widely different from my experience but I'm starting to understand what's going on. For my cpu I see only minor variation between the slicing or not and at at different sizes while testing 512->1024. My i7 has a measly 8MB L3 cache, fully blown even at just 512x512 with slicing. Whereas "M1 Pro and M1 Max have 24 MB and 48 MB respectively of system level cache (SLC). The M1 Ultra combines two M1 Max chips in one package[15] for a total of 20 CPU cores and 96 MB system level cache (SLC)".
Anyone knows an easy way to get L3 cache size in python/torch? Also might help to run lscpu | grep cache and post output if significantly different from what's already posted by others.

L2 cache:                        1 MiB
L3 cache:                        8 MiB

@i3oc9i
Copy link

i3oc9i commented Sep 15, 2022

There is not lscpu in MacOs M1 sorry

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 15, 2022

Can you try getconf -a | grep CACHE_SIZE? I get

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_DCACHE_SIZE                 32768
LEVEL2_CACHE_SIZE                  262144
LEVEL3_CACHE_SIZE                  8388608
LEVEL4_CACHE_SIZE                  0

@krummrey
Copy link
Contributor

Can someone give me a hand, how do I check out a pull request?
git pull origin pull/582/head:development didn't work.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 15, 2022

Maybe just git pull origin pull/582/head?

@krummrey
Copy link
Contributor

I ran it 3 time each (-n3):
512x512 - 1:02, 1:04, 1:04
640x640 - 3:33, 3:39, 3:27
No improvement here, even a slight increase in render times.

@Vargol
Copy link
Contributor

Vargol commented Sep 17, 2022

Yes, ignoring that the the dev branch was a 400% slow down from main due to that typo:-)

But the point of that post was to prove to myself, even if no one else, that the issues I'm seeing with '-W704 -H512 ' is nothing to do with this PR.

@Any-Winter-4079
Copy link
Contributor

Any-Winter-4079 commented Sep 17, 2022

@mh-dm Yep, you were right that it is almost the exactly same code for 16GB.

This returns your memory, e.g. 16 for 16GB.

Screenshot 2022-09-17 at 18 48 10

With 16GB, you'd go through einsum_op_mps_v2

Screenshot 2022-09-17 at 18 48 46

For images 512x512, run einsum_op_compvis

Screenshot 2022-09-17 at 18 57 54

which is

Screenshot 2022-09-17 at 18 50 22

Else, use einsum_op_slice_0 with slice_size 1

Screenshot 2022-09-17 at 18 52 12

Compared to current dev branch code (with >=8 typo, which should read >8)

Screenshot 2022-09-17 at 18 53 38

Maybe the available RAM at any point affects performance (e.g. one day may report faster results than a different day, because of other apps, etc. running).

If the different results from dev branch were from the same day, I'm not sure what it could be. Maybe calling the einsum_op_compvis function with portions of the data, e.g. r[i:end] = self.einsum_op_compvis(q[i:end], k[i:end], v[i:end]) (which I don't know how it works -maybe the slice of data is copied using ranges?)

Other than that, I really don't see any difference.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 17, 2022

Thanks for looking. Can I get the github review approve so I can submit then?

Oh and to answer the question, slices like q[i:end] are not copies.

Copy link
Contributor

@Any-Winter-4079 Any-Winter-4079 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit gives 6-10% fasters results for 64-128GB (10% at larger sizes) M1 and fixes a typo for 8GB M1. Performance on 32GB M1 has not been tested (maybe I missed it in the Conversation), but will run the same code as 64-128GB, which is the behavior already present on the dev branch. For 16GB M1, there are mixed results (more testing would be needed).

@Any-Winter-4079
Copy link
Contributor

Thanks for looking. Can I get the github review approve so I can submit then?

Oh and to answer the question, slices like q[i:end] are not copies.

By the way, have you tested on CUDA?

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 17, 2022

Yeah, tested ~6% faster on 8gb nvidia card for 512x512. I tested various sizes until 1024x1024 at which point it was ~10% faster. Intel v7 cpu got the highest improvement, I think ~25% overall for 512x512 down to 7-8s/it though.

@mh-dm mh-dm merged commit e0951f2 into invoke-ai:development Sep 17, 2022
@Any-Winter-4079
Copy link
Contributor

Any-Winter-4079 commented Sep 17, 2022

@krummrey @netsvetaev
By the way, because 16GB have reported mixed results, if anyone with 16GB RAM tests (same time of day, same apps open/closed, exiting dream> after every test, leaving a few minutes not to overheat -which is a real thing for me at 64GB impacting performance) and there seems to be a consistent slowdown, you can always do a Pull request to add your own code to run for 16GB Mac. For example here
Screenshot 2022-09-17 at 19 25 49
Just need to add an if, e.g. if self.mem_total_gb==16 and then just call the current code you are running (or an even more improved version). The current code (before this merge) is explained here. #582 (comment) 👍
With this I mean that we don't want to leave anyone behind :) If you don't know how to do a pull request (I didn't, a few days ago), just open an issue, report your experiments/results, share the version of the code that works best and even I can submit a pull request to merge it into the development branch.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 17, 2022

image
One thing I'd try when experimenting with optimizations is changing > 8 into > 16 to always call slice_0.

Another thing to try would be:

    def einsum_op_mps_v2(self, q, k, v):
        if self.mem_total_gb <= 8:
            return self.einsum_op_slice_0(q, k, v, 1)
        return self.einsum_op_tensor_mem(q, k, v, 64)

Where you try different values, ex: 64, 48, 32, 24, 16.

@mh-dm mh-dm deleted the attention branch September 17, 2022 17:49
@netsvetaev
Copy link
Contributor

netsvetaev commented Sep 17, 2022

Seems it's got faster on the latest dev:
1.11it/s 512 (was 1.04 maximum), 45s total
5.25s/it 768 (was 6.20), 04:22s total
That's impressive!

For 1024 I have same numbers from 23s, increasing with every sample up to 25-26 (was 19s as the best result earlier). So +5 minutes in total. Strange.

@mh-dm @Any-Winter-4079 I'll try to edit the code, thank you for the tips.

@i3oc9i
Copy link

i3oc9i commented Sep 18, 2022

on last commit e040c35

"banana split" -s50 -W512 -H512 -C7.5 -Ak_lms. (OK 0:20) same as previous report
"banana split" -s50 -W896 -H576 -C7.5 -Ak_lms (OK 0:50 ) same as previous report
"banana split" -s50 -W1024 -H1024 -C7.5 -Ak_lms (OK 2:44) same as previous report

Birch-san added a commit to Birch-san/stable-diffusion that referenced this pull request Sep 18, 2022
@Birch-san
Copy link

Birch-san commented Sep 18, 2022

I've tried this plus #519 out on my fork:
Birch-san/stable-diffusion@18bb5f8...optimized-attention

using very recent PyTorch nightly: 1.13.0.dev20220917 on M1 Max (64GB).
8 steps, Heun sampler.
and I always ran the "optimized" test first, to give it the best chance (i.e. lower temperature).

1 sample:

optimized:
11.4 secs

non-optimized:
10.4 secs

3 samples:

optimized:
37.8 secs

non-optimized:
30.2 secs

Conclusion

Optimized attention is 10~25% slower than "CompVis original code + some dels".
not necessarily due to this PR. I got similar results trying out the algorithm from the previous PR:
#540 (comment)

@Vargol
Copy link
Contributor

Vargol commented Sep 18, 2022

@Birch-san Have you tried in with a release pytorch I see significant slowdown with the nightlies.

@Birch-san
Copy link

Birch-san commented Sep 18, 2022

using PyTorch stable 1.12.1

1 sample

optimized:
10.1 secs

non-optimized:
9.8 secs

3 samples

optimized:
31.2 secs

non-optimized:
28.7 secs

Conclusion

Optimized attention is 3~9% slower than "CompVis original code + some dels".
PyTorch stable is 5~6% faster than recent nightly.

the multi-sample tests may be a bit unfair because the heuristic is designed to study pixel dimensions rather than number of samples or conditions. but still, the single-sample tests are fair game.

@Birch-san
Copy link

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB. 😛

@Birch-san
Copy link

Birch-san commented Sep 18, 2022

using PyTorch stable 1.12.1

1 sample

optimized:
9.8 secs

non-optimized:
9.8 secs

3 samples

optimized:
26.9 secs

non-optimized:
28.7 secs

Conclusion

If you go down the right code path (i.e. quitting apps to free up 32GB of memory), then optimized attention is identical speed for 1 sample, and 6.7% faster for a batch-of-3 samples.
that's 9.0 secs per sample, so batch-of-3 is 8.8% faster per sample than batch-of-1.

@Birch-san
Copy link

btw, I think I see a way to make inference faster.
instead of allocating new tensors each time: we could retain and re-use tensors.
it'd be a bit of a mess though. potentially increases the peak VRAM use (because you'd have a few large tensors that you'd never free).

I also wonder whether slapping .contiguous() on any of these tensors would make it cheaper to do arithmetic on them. due to improved locality.

@mh-dm
Copy link
Contributor Author

mh-dm commented Sep 18, 2022

btw, I think I see a way to make inference faster. instead of allocating new tensors each time: we could retain and re-use tensors. it'd be a bit of a mess though. potentially increases the peak VRAM use (because you'd have a few large tensors that you'd never free).

pytorch really doesn't like modifying/re-using tensors. I had to re-land #569 with a .clone() fix so that training still works and even a simple *= scalar seemed to introduce issues for 1024x1024 on M1. Finally, there's not that much left to speed up unfortunately when the slowest operation/the bottleneck is s = s.softmax(dim=-1) for which there's no inplace version available in pytorch. If anyone reading has contributed to pytorch before and would like a nice rewarding challenge try adding support for softmax_ or softmax(inplace=True).

I also wonder whether slapping .contiguous() on any of these tensors would make it cheaper to do arithmetic on them. due to improved locality.

Less chance of seeing significant improvements as you'd be introducing a copy. Plus I think reshape() already results in contiguous copy.

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB

You have the configuration to try out M1 16GB+ specific optimizations that Any-Winter-4079 and I have suggested, just a few messages up.

@Any-Winter-4079
Copy link
Contributor

Any-Winter-4079 commented Sep 19, 2022

interesting, it seems I'm being sent down the M1 16–32GB code path. my memory_available was measured to be 31.89 GB. 😛

If you go down the right code path (i.e. quitting apps to free up 32GB of memory), then optimized attention is identical speed for 1 sample, and 6.7% faster for a batch-of-3 samples.

@Birch-san This new commit doesn't use mem_available on M1 though. It uses mem_total. You shouldn't need to quit apps nor get 31.89 of available memory (b/c it's mem_total that's measured).

See: self.mem_total_gb = psutil.virtual_memory().total // (1 << 30)

For me (64GB M1 Max), this code is about 6% faster for 512x512 and about 10% faster for 1024x1024.
As an example, for 512x512 I can go past 1.9it/s where I'd be lucky before to get 1.8.

afiaka87 pushed a commit to afiaka87/lstein-stable-diffusion that referenced this pull request Sep 19, 2022
…optimizations

Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor.
When we have enough VRAM don't make a useless zeros tensor.
Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations.
For cuda and cpu keep VRAM usage and faster slicing consistent.
For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it.
Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
afiaka87 pushed a commit to afiaka87/lstein-stable-diffusion that referenced this pull request Sep 19, 2022
…optimizations

Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor.
When we have enough VRAM don't make a useless zeros tensor.
Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations.
For cuda and cpu keep VRAM usage and faster slicing consistent.
For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it.
Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
afiaka87 pushed a commit to afiaka87/lstein-stable-diffusion that referenced this pull request Sep 19, 2022
…optimizations

Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor.
When we have enough VRAM don't make a useless zeros tensor.
Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations.
For cuda and cpu keep VRAM usage and faster slicing consistent.
For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it.
Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
afiaka87 pushed a commit to afiaka87/lstein-stable-diffusion that referenced this pull request Sep 21, 2022
…optimizations

Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor.
When we have enough VRAM don't make a useless zeros tensor.
Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations.
For cuda and cpu keep VRAM usage and faster slicing consistent.
For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it.
Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
austinbrown34 pushed a commit to cognidesign/InvokeAI that referenced this pull request Dec 30, 2022
…optimizations

Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor.
When we have enough VRAM don't make a useless zeros tensor.
Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations.
For cuda and cpu keep VRAM usage and faster slicing consistent.
For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it.
Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants