[Core] AnimateDiff: Long context video generation #8275

a-r-r-o-w · 2024-05-26T00:52:27Z

What does this PR do?

Support for long context and infinite-length video generation has been present for a long time in UIs and custom implementations. This is an attempt at adding the same to diffusers.

Fixes #6521 and a few other unclosed discussions partially.

Code

import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, DDIMScheduler
from diffusers.pipelines.animatediff.context_utils import ContextScheduler
from diffusers.utils import export_to_gif
from PIL import Image

model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe = AnimateDiffPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to("cuda")

prompt = "a robot walking dominantly standing over a destroyed planet, apocalyptic, surreal, high quality"
negative_prompt = "low quality, worst quality, unrealistic"

video = pipe(
    prompt=prompt, # or provide multiple in a list
    negative_prompt=negative_prompt,
    height=512,
    width=512,
    num_frames=32,
    guidance_scale=7,
    num_inference_steps=20,
    generator=torch.Generator().manual_seed(42),
    decode_batch_size=8,
    # length must be less than max_motion_module_length (usually: 32, default as 16 is good)
    context_scheduler=ContextScheduler(length=16, stride=3, overlap=4, loop=True, type="uniform_constant"), 
    clip_skip=2,
)

# If you want to do normal processing without any of the new additions impacting AnimateDiff, don't
# pass anything for context_scheduler. It will retain old behaviour and you will be limited to a max
# of 24/32 frames based on motion module.
video = pipe(
    prompt=prompt, # or provide multiple in a list
    negative_prompt=negative_prompt,
    height=512,
    width=512,
    num_frames=16,
    guidance_scale=7,
    num_inference_steps=20,
    generator=torch.Generator().manual_seed(42),
    decode_batch_size=8, 
    clip_skip=2,
)

frames = video.frames[0]
export_to_gif(frames, "video.gif", fps=8)

32 frames - single prompt	64 frames - two prompts (does not work too well yet)

There are a couple of problems with the implementation at the moment, and, I believe, even with the original reference repositories:

You cannot set loop=False for all kinds of configurations. This results in total_counts to have 0'ed values when certain indices are never processed causing the latents to get filled with nan's
ordered_halving has some kind of sorcery going on. From what I think, it is acting as a pseudo random number generation and so we can replace it out with something more understandable (see my commented code for example)
Does not seem to be interpolating well between prompts and actually depicting the "prompt travel" yet. Seems like a bug on my end
Not sure why the original implementations uses stride lengths of 8 or higher (context_stride >= 4) because the frames are quite far apart to need temporal averaging, and I don't see much difference in results. It seems like unnecessary more iterations.
I'm not too sure about the code design. callback implementation for context scheduler?

Additionally, noticing much better results when applying to vid2vid/controlnet but will push those changes once we finalize design and fix any other bugs.

Other methods to potentially look into in the future:

FIFODiffusion: FIFO-Diffusion: Generating Infinite Videos from Text without Training through Rolling Video Denoising #8274
FreeNoise: Request to implement FreeNoise, a new diffusion scheduler #5576 (@JosefKuchar had some updates on this)
StreamingT2V: https://streamingt2v.github.io/
https://github.com/Lightricks/LongAnimateDiff

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @sayakpaul @yiyixuxu

DN6 · 2024-05-27T06:52:14Z

Given the issues with the context scheduler (not having a good understanding of ordered halving is an issue IMO and the scheduler code is quite confusing to read), perhaps we looks into FIFO as our first option for longer context video?

a-r-r-o-w · 2024-05-27T07:07:28Z

Given the issues with the context scheduler (not having a good understanding of ordered halving is an issue IMO and the scheduler code is quite confusing to read), perhaps we looks into FIFO as our first option for longer context video?

I'm quite confident its role is similar to that of a PRNG. The context scheduler is essentially just trying to iterate through different batches of 16 frames (or whatever is set as the length) starting at different offset positions. Let's take a small example of what would yield a good video of 16 frames if we could only process 8 frames at a time using the motion module:

num_frames = 16
length = 8
overlap = 4
stride = 2 (strides = [2 ** 0, 2 ** 1] => [1, 2])

for current stride = 1,
[0, 1, 2, 3, 4, 5, 6, 7]
[4, 5, 6, 7, 8, 9, 10, 11]
[8, 9, 10, 11, 12, 13, 14, 15]
[12, 13, 14, 15, 0, 1, 2, 3] # In case we want a looping effect
# with the above overlaps, we would have enough smoothing to make the transitions not look too abrupt but with added processing of bigger strides, we can see much lesser abruptness

for current stride = 2,
[0, 2, 4, 6, 8, 10, 12, 14] # achieves temporal smoothing. it does not have to be this; it could be the following as well: [1, 3, 5, 7, 9, 11, 13, 15]

From my testing, I think it suffices to use a simple schedule that:

processes 16/24/32 frames (based on whatever motion module supports) at a time
processes in strides of 1 and strides of 2 (and maybe 4 as well)
considers overlap like above

because it gives decent results as well.

FIFODiffusion currently supports Videocrafter, Open Sora and Zeroscope. We should definitely try integrating the crafter family of models due to much research utilizing it and the methods around it. But, I think due to the success of AnimateDiff in the community, having existing methods integrated should be something to look at too? Maybe freenoise would be a better first candidate to take up

a-r-r-o-w · 2024-05-29T02:46:16Z

Just finished reading the FIFODiffusion paper and have a somewhat decent understanding of the implementation now, and I understand that the proposed inference method is agnostic to the underlying video model. I'll have a PR with AnimateDiff hopefully by the weekend 🤞

a-r-r-o-w · 2024-06-02T00:52:58Z

@DN6 @sayakpaul I'm working on adding support for FIFODiffusion here: https://github.com/a-r-r-o-w/diffusers/tree/animatediff/fifodiffusion. It is not yet at a working stage.

The idea is quite straightforward to implement but I'm facing difficulties dealing with internal modeling code and am not sure if it supports what I'm trying to do. Is there a way to pass a list of timesteps to the unet instead of a single value? That is, to use different timestep value per frame of the video. Forward pass per frame could be done each with its own timestep but then motion model won't work because it requires that all frames be passed together. I tried modifying it to my needs and failed because if it doesn't break at one place, it breaks elsewhere. For the scheduler, I'd like to use a per-frame timestep (list of timestep values) too instead of a single value but that is more easily doable with a loop and no modifications required.

As an alternative implementation idea, I also think you can make it work by denoising first 16 frames completely, then using their 15th, 14th, 13th, ..., timestep latent prediction for the next 16 frames and so on. So, you would have to maintain a memory per denoising step per frame, needing to be updated every iteration and, I think, seemingly tricky to implement, which adds extra memory of (B, C, motion_seq_length, num_inference_steps, H, W) and will quite easily run into OOM. I'm assuming motion_seq_length generally used is 16 in the above example. Also, num_inference_steps is required to be n * motion_seq_length where n is the number of latent partitions, set to 4 in the paper, because otherwise I'm not sure how you would interpolate the timesteps for queue size to match user-preferred num_inference_steps.

Edit: My bad, I thought I was commenting on the FIFODiffusion issue by clarence but replied here instead :(

a-r-r-o-w · 2024-06-02T01:08:10Z

cc @jjihwan Would love to have your thoughts and if what I said above looks correct

jjihwan · 2024-06-02T04:16:02Z

@a-r-r-o-w
Thank you for your effort.
I already have the source code for zeroscope, which is implemented based on diffusers.
If you need it for reference, please send me an email.
: kjh26720@snu.ac.kr

github-actions · 2024-09-14T15:11:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

initial draft for long context animatediff sliding window

2363623

a-r-r-o-w changed the title ~~[AnimateDiff] Long context video generation~~ [Core] AnimateDiff: Long context video generation May 26, 2024

Merge branch 'main' into animatediff/long-context-sliding-window

8ac2bd1

github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024

a-r-r-o-w closed this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] AnimateDiff: Long context video generation #8275

[Core] AnimateDiff: Long context video generation #8275

a-r-r-o-w commented May 26, 2024 •

edited

Loading

DN6 commented May 27, 2024

a-r-r-o-w commented May 27, 2024 •

edited

Loading

a-r-r-o-w commented May 29, 2024

a-r-r-o-w commented Jun 2, 2024 •

edited

Loading

a-r-r-o-w commented Jun 2, 2024

jjihwan commented Jun 2, 2024 •

edited

Loading

github-actions bot commented Sep 14, 2024

[Core] AnimateDiff: Long context video generation #8275

[Core] AnimateDiff: Long context video generation #8275

Conversation

a-r-r-o-w commented May 26, 2024 • edited Loading

What does this PR do?

Who can review?

DN6 commented May 27, 2024

a-r-r-o-w commented May 27, 2024 • edited Loading

a-r-r-o-w commented May 29, 2024

a-r-r-o-w commented Jun 2, 2024 • edited Loading

a-r-r-o-w commented Jun 2, 2024

jjihwan commented Jun 2, 2024 • edited Loading

github-actions bot commented Sep 14, 2024

a-r-r-o-w commented May 26, 2024 •

edited

Loading

a-r-r-o-w commented May 27, 2024 •

edited

Loading

a-r-r-o-w commented Jun 2, 2024 •

edited

Loading

jjihwan commented Jun 2, 2024 •

edited

Loading