Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

tonetechnician · 2022-10-19T07:55:08Z

What API design would you like to have changed or added to the library? Why?

Hey there!

I'm not sure if this is the right section to post this, but I have a request/question for a write up on the inference configuration used by diffusers. Similar to a config.yaml in other model repos.

Recently I have been digging in quite a bit to diffusers and comparing with other stable diffusion implementations to compare their outputs (see post here).

I've noticed that there quite noticeable differences (both in output and code) between diffusers and the regular stable-diffusion inference model https://github.com/CompVis/stable-diffusion/blob/main/configs/stable-diffusion/v1-inference.yaml as implemented in both Automatic1111 and SD-GUI which give the same results to one another, but diffusers is an outlier.

I dug deeper into the model architecture in diffusers and did notice there are a few differences in the default values set for each block for just about all steps of the stable diffusion process. however, my knowledge of the architecture itself isn't as good as I'd like it to be so I'm mostly comparing the stable diffusion architecture and trying to match it with diffusers. That being said, I did try to match the settings best I could in order to try get a one to one result. Modifying parameters given in the VAE encoder seems to have a quite an effect on what image gets outputted and it's led me to believe there must be a fundamental difference between the inference model and base stable diffusion model.

I did find this script https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py and ran through the procedure, but wasn't entirely sure what it actually does and how the variables fit in exactly the models used in diffusers, but I do see some defaults differ. I figure @patil-suraj may have a bit more info on the architecture within diffusers, and how it differs from the original stable-diffusion repo.

I've noticed the largest differences seem to be in the img2img pipelines, where I believe the output is not as crisp and sharp as the base stable-diffusion library, and felt that this is something that should probably be solved one way or another.

Would love to know if a config file, or write up on the usage of the conversion scripts in the /scripts directory would be possible!

patrickvonplaten · 2022-10-20T16:24:58Z

Hey @tonetechnician,

Thanks a lot for the write-up! Could you by any chance add a reproducible code snippet that compares the two?
It would be extremely useful if you could post some code snippets for which I could compare diffusers to the original code base :-)

In our experiments, two months ago, diffusers used to match 1-to-1 to https://github.com/CompVis/stable-diffusion
as shown in this PR: #182
Now it might be that we have a regression currently on master as stated in #902 .

Just to be clear - we want to ensure 100% 1-to-1 the same output as the original repo and if that's not the case currently it's clearly a bug!
As noted in this issue: #914 our integration tests are not strong enough at the moment, which we're trying to solve asap.

Thanks for bringing this clearer to our attention!

patrickvonplaten · 2022-10-20T16:31:03Z

Regarding, the architecture - it's one to one the same as the original architecture. We just renamed some keys to make the naming clearer and the overall architecture more general - so that we're able to add more model architectures than just stable diffusion in the future :-)

Apart from this there are two main differences:

We use nn.Linear for the attention layers (see here), whereas the original CompVis repo uses Convolutional layers (see here) . This should not make any difference in the actual output (I tested this extensively). We're using linear layers because attention is inherently a linear projection and linear layers are faster/easier to work with than Conv layers
We merged a PyTorch optimization PR which moved more and more code directly to PyTorch: Optimize Stable Diffusion #371 . This gave us a 35% speed-up but might have been the cause for the regression mentioned in Potential regression in deterministic outputs #902

I'll try to solve the differences and keep you updated here :-)

Any additional code snippets I could test would be extremely useful!

tonetechnician · 2022-10-20T16:47:24Z

No worries!

Thanks for the detailed response. I'd be super keen to help however I can to figure it out. Will give a bit more info and code to compare the two implementations.

Currently I've just been testing using Automatic's + NMKD SD-GUI implementation vs a diffusers to compare which I think could be a place to start if you want to confirm the differences in output.

github-actions · 2022-11-18T15:02:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

pcuenca mentioned this issue Oct 19, 2022

Potential regression in deterministic outputs #902

Closed

This was referenced Oct 21, 2022

New Scheduler: add Euler Ancestral Scheduler to StableDiffusionPipeline #636

Closed

0.5.1 k schedulers #944

Closed

github-actions bot added the stale Issues that haven't received updates label Nov 18, 2022

github-actions bot closed this as completed Nov 27, 2022

PhaneeshB pushed a commit to nod-ai/diffusers that referenced this issue Mar 1, 2023

Enable pytests on Windows (huggingface#901)

a908121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

tonetechnician commented Oct 19, 2022 •

edited

Loading

patrickvonplaten commented Oct 20, 2022

patrickvonplaten commented Oct 20, 2022

tonetechnician commented Oct 20, 2022 •

edited

Loading

github-actions bot commented Nov 18, 2022

Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

Is there a reference for the model/architecture used by diffusers anywhere? It doesn't seem to match to the original stable-diffusion repo #901

Comments

tonetechnician commented Oct 19, 2022 • edited Loading

patrickvonplaten commented Oct 20, 2022

patrickvonplaten commented Oct 20, 2022

tonetechnician commented Oct 20, 2022 • edited Loading

github-actions bot commented Nov 18, 2022

tonetechnician commented Oct 19, 2022 •

edited

Loading

tonetechnician commented Oct 20, 2022 •

edited

Loading