-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep VRAM usage and faster slicing consistent in attention.py #582
Conversation
FYI @Any-Winter-4079 since you edited this recently. |
I have tested this PR on MacOs 512x512. -s50 / OK but there is not improvment in speed |
Yeah, this is going to be tricky :) For ongoing development of the solution, I would seek a thumbs up from 8GB (@Vargol), 16GB (@netsvetaev, @krummrey), 32GB (@0t0m0, @jroxendal) and 64 (@Any-Winter-4079) or 128GB (@i3oc9i) M1 machines before merging, as they seem to report various results. It should maintain the same speed for 512x512 and larger images (e.g. 1024x1024) and not crash on the latter. I do agree there is probably a unified way to do this, though. I'm trying to get Textual Inversion working on M1, but I'll try to take a look. Thanks! |
Going to be a while before I can get around to testing, I notice a lot of the VRAM reduction fixes have been remove which concerns me, but I guess the move to functions means they should fall out of scope and get collected so it might not be a disaster. |
My problems with ram may be related to non-arm brew and installation process differences. So I need to reinstall it. Right now fresh install do cause some errors though. |
@Any-Winter-4079
Edited: |
Non-scientific findings (i.e I shut down some of the worst memory hogs but didn't do a full restart to clear up memory) |
Sorry, can't test, seem to be in dependancy hell, I think something has been upgraded to sue a newer version of protobuff |
After a fresh install I've got 1.38s/it 512, 6.20s/it 768 and 20.5s/it 1024 (main branch). So it were mine problems with something non-arm. |
Development: This attention.py version:
|
Tag me when this is ready for review. You might want to convert this PR into a draft if you're still actively working on it. |
Sorted my dependency hell. 512x512, around normal speed |
Thank you for testing this! |
Ooops was testing the wrong thing, thats what you get for dealing with dependancy problems during breaks from a training course lol. Its actually a disaster...512x512 currently doing 18S/it down from 5.5-6.0s/it will let it run a while as the first few samples can be a bit slow, but it normally settles down by where I've got too. |
@netsvetaev I'm in the office right now, will try it later this evening. |
Darn, just realised that means I'm getting those memory issues from main, wasn't getting any from the 1.14.3 test repo. |
@mh-dm banana split -s50 -W1024 -H1024 works but it is almost two times slower than 1.14.1 release (2m55s)
banana split -s50 -W576 -H896 works and it s just a bit slower (50s in 1.14.1 relese)
banana split -s50 -W512 -H512 works at the same speed of 1.14.1 relese
|
On, I can confirm in now for me 512x512 is 400% slower And the reason for that is its going into the compVis calc when I basically need to be a slice 0 for any size image, I'm amazed I actually get any thing at all. Hopefully @mh-dm is beginning to appreciate the code was complex for a very good reason :-) |
Folks, also keep in mind that we have to maintain this code over the long term. Even a 10% performance increase may not be worth it if it causes a 50% increase in maintenance time to track down bugs or add new features. At some point this exercise becomes subject to diminishing returns. |
@i3oc9i @Vargol Thank you, very good data. Widely different from my experience but I'm starting to understand what's going on. For my cpu I see only minor variation between the slicing or not and at at different sizes while testing 512->1024. My i7 has a measly 8MB L3 cache, fully blown even at just 512x512 with slicing. Whereas "M1 Pro and M1 Max have 24 MB and 48 MB respectively of system level cache (SLC). The M1 Ultra combines two M1 Max chips in one package[15] for a total of 20 CPU cores and 96 MB system level cache (SLC)".
|
There is not lscpu in MacOs M1 sorry |
Can you try
|
Can someone give me a hand, how do I check out a pull request? |
Maybe just |
I ran it 3 time each (-n3): |
Yes, ignoring that the the dev branch was a 400% slow down from main due to that typo:-) But the point of that post was to prove to myself, even if no one else, that the issues I'm seeing with '-W704 -H512 ' is nothing to do with this PR. |
@mh-dm Yep, you were right that it is almost the exactly same code for 16GB. This returns your memory, e.g. 16 for 16GB. With 16GB, you'd go through For images 512x512, run which is Else, use Compared to current dev branch code (with >=8 typo, which should read >8) Maybe the available RAM at any point affects performance (e.g. one day may report faster results than a different day, because of other apps, etc. running). If the different results from dev branch were from the same day, I'm not sure what it could be. Maybe calling the Other than that, I really don't see any difference. |
Thanks for looking. Can I get the github review approve so I can submit then? Oh and to answer the question, slices like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit gives 6-10% fasters results for 64-128GB (10% at larger sizes) M1 and fixes a typo for 8GB M1. Performance on 32GB M1 has not been tested (maybe I missed it in the Conversation), but will run the same code as 64-128GB, which is the behavior already present on the dev branch. For 16GB M1, there are mixed results (more testing would be needed).
By the way, have you tested on CUDA? |
Yeah, tested ~6% faster on 8gb nvidia card for 512x512. I tested various sizes until 1024x1024 at which point it was ~10% faster. Intel v7 cpu got the highest improvement, I think ~25% overall for 512x512 down to 7-8s/it though. |
@krummrey @netsvetaev |
Seems it's got faster on the latest dev: For 1024 I have same numbers from 23s, increasing with every sample up to 25-26 (was 19s as the best result earlier). So +5 minutes in total. Strange. @mh-dm @Any-Winter-4079 I'll try to edit the code, thank you for the tips. |
on last commit e040c35 "banana split" -s50 -W512 -H512 -C7.5 -Ak_lms. (OK 0:20) same as previous report |
I've tried this plus #519 out on my fork: using very recent PyTorch nightly: 1.13.0.dev20220917 on M1 Max (64GB). 1 sample:optimized: non-optimized: 3 samples:optimized: non-optimized: ConclusionOptimized attention is |
@Birch-san Have you tried in with a release pytorch I see significant slowdown with the nightlies. |
using PyTorch stable 1.12.1 1 sampleoptimized: non-optimized: 3 samplesoptimized: non-optimized: ConclusionOptimized attention is the multi-sample tests may be a bit unfair because the heuristic is designed to study pixel dimensions rather than number of samples or conditions. but still, the single-sample tests are fair game. |
interesting, it seems I'm being sent down the M1 16–32GB code path. my |
using PyTorch stable 1.12.1 1 sampleoptimized: non-optimized: 3 samplesoptimized: non-optimized: ConclusionIf you go down the right code path (i.e. quitting apps to free up 32GB of memory), then optimized attention is identical speed for 1 sample, and 6.7% faster for a batch-of-3 samples. |
btw, I think I see a way to make inference faster. I also wonder whether slapping |
pytorch really doesn't like modifying/re-using tensors. I had to re-land #569 with a .clone() fix so that training still works and even a simple
Less chance of seeing significant improvements as you'd be introducing a copy. Plus I think reshape() already results in contiguous copy.
You have the configuration to try out M1 16GB+ specific optimizations that Any-Winter-4079 and I have suggested, just a few messages up. |
@Birch-san This new commit doesn't use mem_available on M1 though. It uses mem_total. You shouldn't need to quit apps nor get 31.89 of available memory (b/c it's mem_total that's measured). See: For me (64GB M1 Max), this code is about 6% faster for 512x512 and about 10% faster for 1024x1024. |
…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
…optimizations Apply ~6% speedup by moving * self.scale to earlier on a smaller tensor. When we have enough VRAM don't make a useless zeros tensor. Switch between cuda/mps/cpu based on q.device.type to allow cleaner per architecture future optimizations. For cuda and cpu keep VRAM usage and faster slicing consistent. For cpu use smaller slices. Tested ~20% faster on i7, 9.8 to 7.7 s/it. Fix = typo to self.mem_total >= 8 in einsum_op_mps_v2 as per invoke-ai#582 discussion.
EDIT:
Refactor attention.CrossAttention to remove 4 copies of einsum_op_compvis code.
Remove zeros tensor that's useless when we have enough VRAM.
For cuda keep VRAM usage and faster slicing consistent.