Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

Closed
tonetechnician opened this issue Oct 19, 2022 · 4 comments
Labels
stale Issues that haven't received updates

Comments

@tonetechnician
Copy link

tonetechnician commented Oct 19, 2022

What API design would you like to have changed or added to the library? Why?

Hey there!

I'm not sure if this is the right section to post this, but I have a request/question for a write up on the inference configuration used by diffusers. Similar to a config.yaml in other model repos.

Recently I have been digging in quite a bit to diffusers and comparing with other stable diffusion implementations to compare their outputs (see post here).

I've noticed that there quite noticeable differences (both in output and code) between diffusers and the regular stable-diffusion inference model https://github.com/CompVis/stable-diffusion/blob/main/configs/stable-diffusion/v1-inference.yaml as implemented in both Automatic1111 and SD-GUI which give the same results to one another, but diffusers is an outlier.

I dug deeper into the model architecture in diffusers and did notice there are a few differences in the default values set for each block for just about all steps of the stable diffusion process. however, my knowledge of the architecture itself isn't as good as I'd like it to be so I'm mostly comparing the stable diffusion architecture and trying to match it with diffusers. That being said, I did try to match the settings best I could in order to try get a one to one result. Modifying parameters given in the VAE encoder seems to have a quite an effect on what image gets outputted and it's led me to believe there must be a fundamental difference between the inference model and base stable diffusion model.

I did find this script https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py and ran through the procedure, but wasn't entirely sure what it actually does and how the variables fit in exactly the models used in diffusers, but I do see some defaults differ. I figure @patil-suraj may have a bit more info on the architecture within diffusers, and how it differs from the original stable-diffusion repo.

I've noticed the largest differences seem to be in the img2img pipelines, where I believe the output is not as crisp and sharp as the base stable-diffusion library, and felt that this is something that should probably be solved one way or another.

Would love to know if a config file, or write up on the usage of the conversion scripts in the /scripts directory would be possible!

@patrickvonplaten
Copy link
Contributor

Hey @tonetechnician,

Thanks a lot for the write-up! Could you by any chance add a reproducible code snippet that compares the two?
It would be extremely useful if you could post some code snippets for which I could compare diffusers to the original code base :-)

In our experiments, two months ago, diffusers used to match 1-to-1 to https://github.com/CompVis/stable-diffusion
as shown in this PR: #182
Now it might be that we have a regression currently on master as stated in #902 .

Just to be clear - we want to ensure 100% 1-to-1 the same output as the original repo and if that's not the case currently it's clearly a bug!
As noted in this issue: #914 our integration tests are not strong enough at the moment, which we're trying to solve asap.

Thanks for bringing this clearer to our attention!

@patrickvonplaten
Copy link
Contributor

Regarding, the architecture - it's one to one the same as the original architecture. We just renamed some keys to make the naming clearer and the overall architecture more general - so that we're able to add more model architectures than just stable diffusion in the future :-)

Apart from this there are two main differences:

  • We use nn.Linear for the attention layers (see here), whereas the original CompVis repo uses Convolutional layers (see here) . This should not make any difference in the actual output (I tested this extensively). We're using linear layers because attention is inherently a linear projection and linear layers are faster/easier to work with than Conv layers
  • We merged a PyTorch optimization PR which moved more and more code directly to PyTorch: Optimize Stable Diffusion #371 . This gave us a 35% speed-up but might have been the cause for the regression mentioned in Potential regression in deterministic outputs #902

I'll try to solve the differences and keep you updated here :-)

Any additional code snippets I could test would be extremely useful!

@tonetechnician
Copy link
Author

tonetechnician commented Oct 20, 2022

No worries!

Thanks for the detailed response. I'd be super keen to help however I can to figure it out. Will give a bit more info and code to compare the two implementations.

Currently I've just been testing using Automatic's + NMKD SD-GUI implementation vs a diffusers to compare which I think could be a place to start if you want to confirm the differences in output.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 18, 2022
PhaneeshB pushed a commit to nod-ai/diffusers that referenced this issue Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

2 participants