Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xformers attention #1851

Merged
merged 21 commits into from
Oct 8, 2022
Merged

xformers attention #1851

merged 21 commits into from
Oct 8, 2022

Conversation

C43H66N12O12S2
Copy link
Collaborator

@C43H66N12O12S2 C43H66N12O12S2 commented Oct 7, 2022

This PR adds xformer optimized cross-attention, a flag to disable it and use split instead, _maybe_init function that - for some reason - seems to be necessary for xformers to work in this instance and enables functorch in xformers, which further increased performance on my machine.

We still need a way for easy distribution of xformers. Otherwise, this PR is good to go (barring bugs I've not been able to perceive)
cc. @Doggettx @Thomas-MMJ @ArrowM @consciencia

PS. Much thanks to @fmassa @danthe3rd @yocabon and many others for their generous efforts to bring xformers to Windows.

I've seen a %15 improvement with batch size 1, 100 steps, 512x512 and euler_a. xFormers allows me to output 2048x2048 whereas I would previously OOM.

closes #576

modules/sd_hijack.py Outdated Show resolved Hide resolved
@rabidcopy
Copy link

I'm having trouble finding information on this but does this inadvertently kill Linux AMD support as a new default? I'm not certain xformers can be compiled for ROCM.

@C43H66N12O12S2
Copy link
Collaborator Author

Yeah, it likely would. We could add another check to the if statement for ROCm. Not sure PyTorch has that, I'll look at it though.

@danthe3rd
Copy link

We still need a way for easy distribution of xformers

Yeah totally agree - we are working on something here for linux tho, but no plans at the moment for windows cc @bottler

I'm not certain xformers can be compiled for ROCM.

That's not something we are supporting indeed.

Copy link

@danthe3rd danthe3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a few comments to simplify the code - this looks great otherwise :)

modules/sd_hijack_optimizations.py Outdated Show resolved Hide resolved
modules/sd_hijack_optimizations.py Outdated Show resolved Hide resolved
@SafentisFox
Copy link
Contributor

Something I've seen in some colabs is downloading a pre-compiled version of xformers, is this a viable way to distribute xformers here too?

@C43H66N12O12S2
Copy link
Collaborator Author

It is - we can't use those exact ones as they were built for Linux - but it still has to be built & distributed, something I have no experience in.

@wsippel
Copy link

wsippel commented Oct 7, 2022

Has anyone tried running xFormers through hipify yet? Google gave me nothing, and I don't have CUDA set up to try myself right now.

@Thomas-MMJ
Copy link

Here are windows xformers for 3.9 https://github.com/neonsecret/xformers/releases/tag/v0.14

To get a wheel just do

python setup.py bdist_wheel

@x02Sylvie
Copy link

x02Sylvie commented Oct 7, 2022

I wonder if xformers could be combined with AITemplate #1625 for 15% * 200% * 250% speed boost

@C43H66N12O12S2
Copy link
Collaborator Author

@Thomas-MMJ I think seperate wheels are needed for different GPU archs. Official builds of xformers build seperate wheels. Example: https://app.circleci.com/pipelines/github/facebookresearch/xformers/2900/workflows/5c5de2be-9557-4684-9d10-34cd3835663e

I could provide the compute (build it locally) if somebody's willing to setup the workflow.

@C43H66N12O12S2
Copy link
Collaborator Author

Could somebody please test this?

def xformers_attention_forward(self, x, context=None, mask=None):
    h = self.heads
    q_in = self.to_q(x)
    context = default(context, x)
    k_in = self.to_k(context)
    v_in = self.to_v(context)
    q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b n h d', h=h), (q_in, k_in, v_in))
    del q_in, k_in, v_in
    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None)

    out = rearrange(out, 'b n h d -> b n (h d)', h=h)
    return self.to_out(out)

This exact same code, that produced broken images yesterday, now works for some reason.... still no clue why it failed yesterday or why it suddenly works now.

@ArrowM
Copy link
Contributor

ArrowM commented Oct 7, 2022

Could somebody please test this?

Works for me

@C43H66N12O12S2
Copy link
Collaborator Author

I can now reach 22it/s with the newest version and batch size 8 with a 3080 12GB. Just need to find a way to distribute packages to people and we can ship this to everyone.

@Twizzes

This comment was marked as resolved.

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 7, 2022

@Thomas-MMJ I think seperate wheels are needed for different GPU archs. Official builds of xformers build seperate wheels. Example: https://app.circleci.com/pipelines/github/facebookresearch/xformers/2900/workflows/5c5de2be-9557-4684-9d10-34cd3835663e

I could provide the compute (build it locally) if somebody's willing to setup the workflow.

Looks like conda will be added to their continuous integration,

https://github.com/facebookresearch/xformers/pull/466/files

@SafentisFox
Copy link
Contributor

SafentisFox commented Oct 7, 2022

@C43H66N12O12S2 I can now reach 22it/s with the newest version and batch size 8 with a 3080 12GB. Just need to find a way to distribute packages to people and we can ship this to everyone.

22it/s?!? With batch size 8?! You did not misspell, right? You didn't mean batch count 8 or 2.2it/s?
Because 22it/s with batch size 8 is insane lol

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 7, 2022

This exact same code, that produced broken images yesterday, now works for some reason.... still no clue why it failed yesterday or why it suddenly works now.

So that is without the init? I thought it was the lack of init that was the issue yesterday. (Of course that was pure speculation...)

@ArrowM
Copy link
Contributor

ArrowM commented Oct 7, 2022

22it/s?!? With batch size 8?! You did not misspell, right? You didn't mean batch count 8 or 2.2it/s? Because 22it/s with batch size 8 is insane lol

With a 3080, they definitely meant one or the other, not both at the same time.

@C43H66N12O12S2
Copy link
Collaborator Author

C43H66N12O12S2 commented Oct 8, 2022

@SafentisFox
Sorry, should've clarified 😅, it's 2.77it/s with batch size 8, which basically amounts to 22it/s for a single image.

@Thomas-MMJ
Yep, without the init function.

@AUTOMATIC1111
It'd be great to get some help with this. xformers should basically build out-of-the-box now, we just need to distribute built packages. We can't have the users build it as it requires VC++ build tools and nvcc.

@C43H66N12O12S2
Copy link
Collaborator Author

@htkg We limited it to Ampere as my wheels only work with Ampere. Hopefully Meta will distribute wheels for Windows, and we can remove a lot (nearly all, actually) of these checks.

@chekaaa
Copy link
Contributor

chekaaa commented Oct 8, 2022

My tests so far using a 3070:

Euler a - 20 steps

xFormers off - Batch count 10 :
xformer_off-10c

xFormers off - Batch count 5 - Batch size 2 :
xformer_off-5c-2s

xFormers on - Batch count 10 :
xformer_on-10c

xFormers on - Batch count 5 - Batch size 2 :
xformer_on-5c-2s

@C43H66N12O12S2
Copy link
Collaborator Author

@chekaaa ramp up the batch size for larger gains

@chekaaa
Copy link
Contributor

chekaaa commented Oct 8, 2022

xFormers off - Batch count 5 - Batch size 6:
xformer_off-5c-6s

xFormers on - Batch count 5 - Batch size 6:
xformer_on-5c-6s

@leohumnew
Copy link

What do I need to do to get this to work? Just add "--xformers" to COMMANDLINE_ARGS in the .bat file?

@chekaaa
Copy link
Contributor

chekaaa commented Oct 8, 2022

@leohumnew yes

@kaneda2004
Copy link
Contributor

Doesn't appear to auto-install xformers
GPU is RTX 3090

Console log:

venv "F:\StableD\stable-diffusion-automatic1111\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Commit hash: 3061cdb
Installing xformers
Installing requirements for Web UI
Launching Web UI with arguments: --xformers
Cannot import xformers
Traceback (most recent call last):
File "F:\StableD\stable-diffusion-automatic1111\modules\sd_hijack_optimizations.py", line 15, in
import xformers.ops
ModuleNotFoundError: No module named 'xformers'

@wstrinz
Copy link

wstrinz commented Oct 8, 2022

Doesn't appear to auto-install xformers GPU is RTX 3090

Console log:

venv "F:\StableD\stable-diffusion-automatic1111\venv\Scripts\Python.exe" Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] Commit hash: 3061cdb Installing xformers Installing requirements for Web UI Launching Web UI with arguments: --xformers Cannot import xformers Traceback (most recent call last): File "F:\StableD\stable-diffusion-automatic1111\modules\sd_hijack_optimizations.py", line 15, in import xformers.ops ModuleNotFoundError: No module named 'xformers'

Same here, #1851 (comment) got it working for me

@kaneda2004
Copy link
Contributor

Sounds like you've got Python 3.9 installed - that whl won't work for me - but this one did:

https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/b/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

And it's crazy fast. @C43H66N12O12S2 thank you for your work on this PR. Keep flying high :)

@leohumnew
Copy link

Is there any decrease in quality with this? Or should it be equivalent to without, but just a bit faster?

@kaneda2004
Copy link
Contributor

Is there any decrease in quality with this? Or should it be equivalent to without, but just a bit faster?

So far my testing shows same quality (have only tested a handful of samplers w/ it)
and I'm getting approx 50% speedup, more if I batch together smaller images to max out my vram. in which case I'm seeing over 100% speedup. (Eight 512x512 images batched takes 8 seconds, 1 sec per image)

@JustMaier
Copy link
Contributor

I'm running a 3090 and noticing about a 20-30% speed up, good stuff.

I have however noticed a strange issue, repeat generations with the same params can give different results. I've created an issue about it. I wonder if it's just me or if others have noticed the same thing.
#1999

@ifffrt
Copy link

ifffrt commented Oct 8, 2022

Is there a guide on how to DIY the wheels for this on your local computer? I'm running an outdated Maxwell GPU but I still want to try this out anyway.

@kaneda2004
Copy link
Contributor

Is there a guide on how to DIY the wheels for this on your local computer? I'm running an outdated Maxwell GPU but I still want to try this out anyway.

I've built it for t4 and for p100 but only on a colab environment.

It's literally a pip install command once you have the build environment setup. I expect that it's not too different on windows.

Hope you have a lot of time though. Took about 45 minutes to compile and the first time it failed lol.

@qJake
Copy link

qJake commented Oct 9, 2022

Trying to get a prebuilt xformers running on Windows x64 / Python 3.9 with a 3070.

Running pip install https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/b/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl --prefer-binary yields:

ERROR: xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl is not a supported wheel on this platform.

Running pip install https://github.com/neonsecret/xformers/releases/download/v0.14/xformers-0.0.14.dev0-cp39-cp39-win_amd64.whl --prefer-binary installs the wheel successfully, but SD Web (with --xformers) does not load it.

Running import xformers in a console yields:

>>> import xformers
Could not find module 'C:\Users\USERNAME\AppData\Roaming\Python\Python39\site-packages\xformers\_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
WARNING:root:WARNING: Could not find module 'C:\Users\USERNAME\AppData\Roaming\Python\Python39\site-packages\xformers\_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop

Any fixes for this?

@ifffrt
Copy link

ifffrt commented Oct 9, 2022

Is there a guide on how to DIY the wheels for this on your local computer? I'm running an outdated Maxwell GPU but I still want to try this out anyway.

I've built it for t4 and for p100 but only on a colab environment.

It's literally a pip install command once you have the build environment setup. I expect that it's not too different on windows.

Hope you have a lot of time though. Took about 45 minutes to compile and the first time it failed lol.

Actually I just found a guide for windows on reddit. It's actually a little bit more involved than that, but it sounds doable.
https://www.reddit.com/r/StableDiffusion/comments/xz26lq/automatic1111_xformers_cross_attention_with_on/

@C43H66N12O12S2
Copy link
Collaborator Author

C43H66N12O12S2 commented Oct 10, 2022

@duckness
Copy link

duckness commented Oct 10, 2022

could you please test this wheel? https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/c/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

it works for me (1070)

@ilcane87
Copy link

@C43H66N12O12S2
Works for me (1060) after:
pip uninstall xformers
pip install xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
Still no speed difference with or without --force-enable-xformers, but that was the case even with my own built wheel.

@salieri-dev
Copy link

salieri-dev commented Oct 10, 2022

@C43H66N12O12S2 i've built wheels by myself, so I cant test it, I think...

p.s building wheels took 15m on rtx super 2060

got 30-40% boost which is awesome

@salieri-dev
Copy link

salieri-dev commented Oct 10, 2022

will try after go to pc if nobody will test on rtx 2060 before

@qJake
Copy link

qJake commented Oct 10, 2022

Trying to get a prebuilt xformers running on Windows x64 / Python 3.9 with a 3070.

Running pip install https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/b/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl --prefer-binary yields:

ERROR: xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl is not a supported wheel on this platform.

Running pip install https://github.com/neonsecret/xformers/releases/download/v0.14/xformers-0.0.14.dev0-cp39-cp39-win_amd64.whl --prefer-binary installs the wheel successfully, but SD Web (with --xformers) does not load it.

Running import xformers in a console yields:


>>> import xformers

Could not find module 'C:\Users\USERNAME\AppData\Roaming\Python\Python39\site-packages\xformers\_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

WARNING:root:WARNING: Could not find module 'C:\Users\USERNAME\AppData\Roaming\Python\Python39\site-packages\xformers\_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop

Any fixes for this?

Closing the loop on this... ended up following the instructions to build xformers locally... was trying to avoid the 3GB of CUDA / 7GB of VC++ dev libraries, but oh well.

Worked first time after pip install -e . finished, took about 45 minutes on a 9th-gen i7.

@Thomas-MMJ
Copy link

To create a wheel, do

python setup.py bdist_wheel

and there will be a wheel put in your dist folder. You can share it and/or keep it around to reinstall later.

@ghost
Copy link

ghost commented Oct 11, 2022

I just wanted to say that the new updates solved my problems, and I am really grateful for that. It was a frustrating experience... thanks to the devs who made it possible. If I only waited long enough I wouldnt have to be in a battle with the cmd all day long...

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 11, 2022

There are now official conda linux xformers builds

[danthe3rd](https://github.com/danthe3rd) commented [13 hours ago](https://github.com/huggingface/diffusers/pull/532#issuecomment-1273656447)

Hi, I'm a maintainer of xFormers,
Just wanted to clarify a few things:
(1) xFormers supports Linux & Windows
(2) We don't have official binaries for windows, but we now (since today) have binaries for linux! You can get them with "conda install xformers -c xformers/label/dev", but they are only available for Python 3.9 or 3.10, CUDA 11.3 or 11.6, and PyTorch 1.12.1
(3) If you don't use binaries, the build can be very long indeed - however it can be significantly faster if you install ninja before, as it can be parallelised. It still takes a dozen minutes on GPUs where we build flash attention (compute capability > 7.5)

huggingface/diffusers#532 (comment)

@Renaldas111
Copy link

Working with python 3.10.8, was not working with 3.9.5 with ModuleNotFoundError: No module named 'xformers' error.

@Thomas-MMJ
Copy link

The xformers wheel you download has to match your python version (3.8/3.9/3.10) and your cuda version (11.6/11.7/11.8) - if either mismatches it won't work.

@salieri-dev
Copy link

any success running xformers under WSL?

@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 16, 2022

any success running xformers under WSL?

yeah xformers works great for me under wsl. If you don't want to build from source, you can use official ones for some python and CUDA and pytorch combinations,

Just wanted to clarify a few things:
(1) xFormers supports Linux & Windows
(2) We don't have official binaries for windows, but we now (since today) have binaries for linux! You can get them with "conda install xformers -c xformers/label/dev", but they are only available for Python 3.9 or 3.10, CUDA 11.3 or 11.6, and PyTorch 1.12.1
(3) If you don't use binaries, the build can be very long indeed - however it can be significantly faster if you install ninja before, as it can be parallelised. It still takes a dozen minutes on GPUs where we build flash attention (compute capability > 7.5)

Originally posted by @danthe3rd in huggingface/diffusers#532 (comment)

Note that to use deepspeed pinning (used for dreambooth ) under WSL you need the Windows 22H2 (released a week ago) and updated wsl (wsl --update), otherwise it is limited to pinning 2 GB of RAM (for dreambooth it wants to pin 16GB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New memory efficient cross attention