-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MPS backend on Apple Silicon devices if it's available. (Updated) #113
base: main
Are you sure you want to change the base?
Conversation
Hey I tried applying your patch locally but I keep getting this error when trying to generate Can you please help resolve this? Running it on M1 MAc 32 GB |
@rahulbudhrani01 You need to update torch and torchvision from nightly build, even the latest released version can not work properly especially VAE encoder. |
Hello - I've been working on trying to get this to work on MPS backend too. My system is M2 with 24GB ram. I set python 3.10.13 with nightly pytorch (2.6.0). I'm using the 384 model. I also did the modifications in the code base throughout such as changing "CUDA" device references to "MPS", changing any float64 errors for bfloat16(rope) and adding an extra with torch.autocast("mps", dtype=torch.bfloat16): to the textencoder block in the pipeline. (otherwise, the prompt_embeds would be NaN). So, I can generate a 1 frame video... BUT, If I try to generate more then 1 frame, I get this... 32fecb56-f0d6-4fba-9022-f794da28b1df_text_to_video_sample.mp4I'm wondering if you might have an idea what is the problem. This looks like when you try to create a pyramid flow video with a resolution that is not 640x384 |
@YAY-3M-TA3 See this pull request for more details but what you need to do is following inline.
mps doesn't support bfloat I believe, also torch autocast may not working on mps, therefore probably just using float32 and disabling autocast would need.
In my case, I've also seen this colorful output caused in step of vae encoding, and it was caued with old pytorch. If I use the latest nightly pytorch 2.6.0.dev then it worked. I haven't really traced which part of vae encoding causing the issue though. So.. I recommend to ensure which version of pytorch is really used, also try float32 instead. |
Yeah, I grabbed your code and ran it - model inits to float32, but on my small 24Gb m2, it OOM (looks like it needs 27GB at least.). However, if I force the model_dtype = "bf16" in app.py, then it can run and output that 1 frame. (I believe torch 2.6.0 does support bloat16 on MPS). However, if I try to do more than 1 frame, then I get that other RGB video. |
@YAY-3M-TA3 You're correct! I just chcecked recent pyrotch changes and yeah, the nightly supports bfloat16 on Sonoma and later also autocast. I am now testing c14de13e-e835-4813-8883-b68bf7a6e008.mp4
|
Ok - I tried your changes(adding them to your app.py) - I did get an error. (Did you get this at all?)
So, I made this change in video_vae/modeling_causal_conv.py to solve...(caching the x dtype, then casting, then casting back)
While this did fix the error, I still get that weird video for videos longer than 1 frame. |
@YAY-3M-TA3 I didn't see such error while I was testing. What if you use https://github.com/niw/Pyramid-Flow/blob/add_simple_cli_command/generate.py which is a simple version of script I use for testing this implementation. |
Yeah, I also have my own test script which is based on generate.py to simply render a video with hardcoded values. I get the same video issue.
Here are my specs: Python: 3.10.13 Here is a cherry-picked list of modules for this conda env
With this setup, I have been able to run things like flux dev with Q8 GUFFs. (Both with mflux and comfyUI)
Haha - I've been reluctant to upgrade my OS to 15 because I heard it was broken with torch... (I also follow this torch MPS thread... pytorch MPS issue) @feifeiobama told me that the causal VAE that they are using can only be modified for 1, 9, and 17 frame conditioned video generation. I tried each of these frame values and also 8 and 16. All of these frame values also result in this color warped video. (I also set the tile_sample_min_size=64 to try to reduce memory. I noticed, that your test video had no tiling artifacts... what tile_sample_min_size are you using 256? Or is your save_memory = false? ) 5478316a-95ca-44c1-b3b0-47b61448745a_text_to_video_sample.mp4 |
Interesting..., I may want to try on macOS 14 and if I can repro (need to find someone nearby who has that machine, likely VM is not an option for GPU work). Sounds likely it's related. I know that M2 SoC has some unexpected behavior with specific math graph on CoreML, but that should be unrelated. May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet... |
Also i noticed I am using slightly newer version of nightly build, but even if I downgraded to 2.6.0.dev20241011, I couldn't repto the problem.
|
On your small test video, how many frames did you do? 8? Are you able to create a video with only 2 frames? (I can at least use a 2 frame video as a confirmed positive case, which will make the VAE tracing faster...) |
@YAY-3M-TA3 I am using duration=2 for testing. And... with help from @kagemiku, it's identified that macOS 14 caused the issue but macOS 15 seems okay. Next step is understanding "why," but at least that is the reason I think. |
Looking in modeling_causal.vae.py in def tiled_decode For a 1 frame video, the decode tensor values look like this before image processing:
However, on a 2 frame video, thedecode tensor values look like this:
I am assuming the value ranges should more like in the 1 frame tensor... |
Okay I’ve identifieed the problem. It's kind of pytorch bug, or mismatch between pytorch expectation vs MPS behavior, and this mismatch is only happening on prior to macOS 15 because pytorch 2.5.0 on macOS 15 is using native stride. I've updated this pull request with fix, but can't test on macOS 14. |
Yes, I m very happy to help you! - you have done a fantastic job! So, far I just tested a 2 frame and it worked! baf388f1-c4b6-4134-8bac-aac67129df2b_text_to_video_sample.mp4I'm now going to do 9 frame, and 16 frame. FYI: It also still works with torch nightly (2.6.0)... I will update you in a couple of hours as I finish the other video renders! |
OK! all confirmed - its working for SONOMA 14.7, with Torch Nightly (2.6.0) Pyramid 384 model cdf94cae-f409-459d-b112-75c12479e3eb_text_to_video_sample-9frames.mp416 frames (~85 minutes to render) 9efcd4bd-477c-4aaf-9169-8caaec57aaf5_text_to_video_sample-16frames.mp4Considering no modern video diffusion model was working on Macs until now! And now we can even render a 16 frame, 640x384 with less than 24Gb - this is quite a milestone! I never would have seen that |
@YAY-3M-TA3 Thanks for the confirmation! I'm glad that it sovled the problem. |
Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: #139006 Approved by: https://github.com/malfet
158494c
to
70e5e1a
Compare
Thanks for supporting miniFLUX @niw |
Still I'm testing niw, though, it seems working as expected. I think the patch working well with Apple Mac thus, If you can test it doesn't break on cuda environment (and if you are confortable with the change, of course!) feel free to merge into main. |
The output seems better than sd3 ones! It'a really impressive that such video can be generated quickly on laptop locally.
a289615a-348f-4683-9add-fdb933e93cb8_text_to_video_sample.mp4 |
thank you all for the help. do you have any idea why the memory would be fine during the generation and then as soon as it hits 100% (generated all frames), it suddenly spikes to like 4x of what it was running at during gen? |
This is likely due to VAE decoding, see #5 (comment). |
@feifeiobama thanks for the reply. I tried reducing the tiling to a very low value of 32, and I am still getting the following out-of-memory error:
I am generating only 3 frames with both guidances at 1, and this is on an M3 Max with 48GB RAM, Sonoma 14.6.1. It just seems like the memory usage should be much much lower. I've got to be misunderstanding something. |
Similar to pytorch#123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: pytorch#139006 Approved by: https://github.com/malfet
Hi, any plans to merge this in so mac users can also use this? Or is there a reason why it's not merged in yet? |
70e5e1a
to
c5fffe8
Compare
@cocktailpeanut nice change! let me do that. |
- Use pytorch 2.5.0 instead of nightly. - FIX: activation error on MPS MPS can't silu activation and creates randomly broken results if tensor memory format is not contiguous. This is not happening on macOS 15 and later because it's using native stride but macOS 14 is affected.
c5fffe8
to
7abf210
Compare
@niw I have cloned use_mps_on_apple_silicon, but after 100% of generation I got this error about FFmpeg not installed or looking for FFmpeg EXE
Any help about this issue |
@PakanAngel Likely you need to install ffmpeg by using such as Homebrew.
|
@cocktailpeanut I just found that |
@niw couldn't we create a separate file TBH even the current See liveportrait as an example: https://github.com/KwaiVGI/LivePortrait/blob/main/requirements_macOS.txt |
Oh hmm. At least it worked (on amd64/linux) to me tho. Also I addressed the pytorch 2.5.1 changes in previous commit so this requirements should work both on cuda and mps. |
This is slighly updated version of #108. Since #108 was accidentally merged and reverted and I can no longer update it with new changes, this Pull Request is newerly created.
For MacBook users who want to try, follow next steps. It needs enough memory, probably about 32GB.
brew install python@3.10
python3.10 -m venv .venv && .venv/bin/pip3 install -r requirements.txt
gradio
forapp.py
, like.venv/bin/pip3 install gradio
.venv/bin/python3 app.py
, then openhttp://127.0.0.1:7860/
.To generate video, try minimum settings first. It takes loooong time anyways on MacBook (about 10 minuts for 3 seconds video, for example, well, it's still remarkable, tho!)
Problems
For the inference, it works faster by using MPS backend on Apple Silicon devices but it's not enabled by default and requires some modification to the code, which only considering CUDA availability.
Solution
Use MPS backend if it's available.
NOTE: This patch is not taking trainig account at all, only for inference. I tried to make it works as well as CUDA with this patch, but because of for example, dependencies update, which may not be preferred, therefore I don't expect that this Pull Request is mergable into
main
for now. However, anyways posting here because I think it’s worth to have it for those who want to try inferencing easily on such as thier MacBook Pro.