-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] AnimateDiff: Long context video generation #8275
[Core] AnimateDiff: Long context video generation #8275
Conversation
Given the issues with the context scheduler (not having a good understanding of ordered halving is an issue IMO and the scheduler code is quite confusing to read), perhaps we looks into FIFO as our first option for longer context video? |
I'm quite confident its role is similar to that of a PRNG. The context scheduler is essentially just trying to iterate through different batches of 16 frames (or whatever is set as the
From my testing, I think it suffices to use a simple schedule that:
because it gives decent results as well. FIFODiffusion currently supports Videocrafter, Open Sora and Zeroscope. We should definitely try integrating the crafter family of models due to much research utilizing it and the methods around it. But, I think due to the success of AnimateDiff in the community, having existing methods integrated should be something to look at too? Maybe freenoise would be a better first candidate to take up |
Just finished reading the FIFODiffusion paper and have a somewhat decent understanding of the implementation now, and I understand that the proposed inference method is agnostic to the underlying video model. I'll have a PR with AnimateDiff hopefully by the weekend 🤞 |
@DN6 @sayakpaul I'm working on adding support for FIFODiffusion here: https://github.com/a-r-r-o-w/diffusers/tree/animatediff/fifodiffusion. It is not yet at a working stage. The idea is quite straightforward to implement but I'm facing difficulties dealing with internal modeling code and am not sure if it supports what I'm trying to do. Is there a way to pass a list of timesteps to the unet instead of a single value? That is, to use different timestep value per frame of the video. Forward pass per frame could be done each with its own timestep but then motion model won't work because it requires that all frames be passed together. I tried modifying it to my needs and failed because if it doesn't break at one place, it breaks elsewhere. For the scheduler, I'd like to use a per-frame timestep (list of timestep values) too instead of a single value but that is more easily doable with a loop and no modifications required. As an alternative implementation idea, I also think you can make it work by denoising first 16 frames completely, then using their 15th, 14th, 13th, ..., timestep latent prediction for the next 16 frames and so on. So, you would have to maintain a memory per denoising step per frame, needing to be updated every iteration and, I think, seemingly tricky to implement, which adds extra memory of Edit: My bad, I thought I was commenting on the FIFODiffusion issue by clarence but replied here instead :( |
cc @jjihwan Would love to have your thoughts and if what I said above looks correct |
@a-r-r-o-w |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Support for long context and infinite-length video generation has been present for a long time in UIs and custom implementations. This is an attempt at adding the same to diffusers.
Fixes #6521 and a few other unclosed discussions partially.
Code
There are a couple of problems with the implementation at the moment, and, I believe, even with the original reference repositories:
loop=False
for all kinds of configurations. This results in total_counts to have 0'ed values when certain indices are never processed causing the latents to get filled with nan'sordered_halving
has some kind of sorcery going on. From what I think, it is acting as a pseudo random number generation and so we can replace it out with something more understandable (see my commented code for example)Additionally, noticing much better results when applying to vid2vid/controlnet but will push those changes once we finalize design and fix any other bugs.
Other methods to potentially look into in the future:
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@DN6 @sayakpaul @yiyixuxu