-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VQ-diffusion #658
VQ-diffusion #658
Conversation
The documentation is not available anymore as the PR was closed or merged. |
12e0d2a
to
f242bf8
Compare
Wooow - this looks amazing! @patil-suraj mind taking a look here? |
Thanks! Would love to know if the minimum unit to merge is the completed pipeline or if smaller chunks like this are acceptable to merge. |
Hey @williamberman and @345ishaan, Sorry for the delay - I have more time going forward and think we can merge this model by next week! It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch! I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists. That's why I removed the Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here? |
Thanks! Pinged in discord as well but the model has a transformer (just the encoder iirc) for the reverse diffusion process instead of a unet. I have the transformer ported on another branch. I think the open question is would you prefer that on this PR or to merge this PR first and then merge the transformer on a separate PR? |
Hey @williamberman, Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-) Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-)) |
SG! Follow up:
|
c564b04
to
496524c
Compare
This sounds like a great plan! |
PR description is updated to reflect progress on merging full model. Notebook which compares outputs from autoencoder, transformer, and text embedder is here and linked in PR description https://github.com/williamberman/vq-diffusion-notebook. Once the scheduler is complete, will also add it to the notebook :) |
a8c9074
to
d14dff7
Compare
@williamberman let me know if you'd like me to review already now or better once the scheduler is integrated as well :-) |
Great progress! |
Let’s wait until the scheduler is integrated. I cleaned some non scheduler components up while working on it that I’d like to add to this branch first :) |
9d33a33
to
6b2bc69
Compare
torch.cuda.empty_cache() | ||
|
||
def test_vq_diffusion(self): | ||
expected_image = load_image( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test here @williamberman - pipeline works like a charm and fits nicely with the existing API :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, exciting!
Hey @williamberman, You've really done an amazing job here! Very impressed by how you were able to add such a new complex model into the existing API! The conversion script and your notebook is very easy to follow :-) I've uploaded the ithq model now to the microsoft org here: https://huggingface.co/microsoft/vq-diffusion-ithq and added a slow test that makes sure the model works as expected. The failing tests are unrelated and we can merge this PR more or less already as is. If ok for you, I would do some final renaming changes tomorrow morning (Paris time) to make it fit a bit better with existing configuration names (will have to sync with @patil-suraj @anton-l and @pcuenca ) and then we can merge this to be in the next release IMO. @patil-suraj @pcuenca @anton-l could you maybe already review this PR? IMO, besides some re-naming it's ready! I've also made sure that all existing slow tests are passing! @williamberman if you're interesting we could do the following follow-up projects to promote this model a bit more:
Obviously no need to do any of the above points if you don't want or are too busy! Regarding this PR, I think we can merge it tomorrow morning! Really great job here 🚀 |
Awesome all sound good! I think a blog post sounds great, sent you an email :) |
@williamberman I am wondering did you try training it or just verified in infer mode? |
Just inference using the weights microsoft published. Training would have been a good amount more work 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome integration @williamberman, thank you for working on it!
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Just happened to have this card on my Windows machine and verified that the SD demo works on it. ``` Average step time: 144.26142692565918ms/it Clip Inference Avg time (ms) = (205.001 + 44.000) / 2 = 124.501 VAE Inference time (ms): 281.001 Total image generation time: 7.856997728347778sec ``` I'd love to add an API upstream to derive compiler tuning flags from a host device.
* Changes for VQ-diffusion VQVAE Add specify dimension of embeddings to VQModel: `VQModel` will by default set the dimension of embeddings to the number of latent channels. The VQ-diffusion VQVAE has a smaller embedding dimension, 128, than number of latent channels, 256. Add AttnDownEncoderBlock2D and AttnUpDecoderBlock2D to the up and down unet block helpers. VQ-diffusion's VQVAE uses those two block types. * Changes for VQ-diffusion transformer Modify attention.py so SpatialTransformer can be used for VQ-diffusion's transformer. SpatialTransformer: - Can now operate over discrete inputs (classes of vector embeddings) as well as continuous. - `in_channels` was made optional in the constructor so two locations where it was passed as a positional arg were moved to kwargs - modified forward pass to take optional timestep embeddings ImagePositionalEmbeddings: - added to provide positional embeddings to discrete inputs for latent pixels BasicTransformerBlock: - norm layers were made configurable so that the VQ-diffusion could use AdaLayerNorm with timestep embeddings - modified forward pass to take optional timestep embeddings CrossAttention: - now may optionally take a bias parameter for its query, key, and value linear layers FeedForward: - Internal layers are now configurable ApproximateGELU: - Activation function in VQ-diffusion's feedforward layer AdaLayerNorm: - Norm layer modified to incorporate timestep embeddings * Add VQ-diffusion scheduler * Add VQ-diffusion pipeline * Add VQ-diffusion convert script to diffusers * Add VQ-diffusion dummy objects * Add VQ-diffusion markdown docs * Add VQ-diffusion tests * some renaming * some fixes * more renaming * correct * fix typo * correct weights * finalize * fix tests * Apply suggestions from code review Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * finish * finish * up Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
VQ-diffusion
original issue: #319
So far, this PR only focusses on the vq-diffusion model for the ITHQ dataset. It is possible that the models for the other datasets require additional changes.
The commits are broken out by the model component they port. See the individual commit messages for more in depth write ups on changes
Example usage
See linked notebook for how to produce pretrained model for diffusers.
Progress
Comparing against original model
I put together a python notebook that compares the currently ported components against eachother. It can be run on google colab and will handle installing dependencies and model weights. See its README for more details.
https://github.com/williamberman/vq-diffusion-notebook