Replies: 9 comments 18 replies
-
Yes, this is how it was trained. Upcoming models are purported to improve this deficiency, and v3 is likely a native 1024x1024 models which sort of mitigates this issue by default. |
Beta Was this translation helpful? Give feedback.
-
The cuttting or cropping of images probably is a major part of the reason we have so much incoherence, especially around the human form. Legs without bodies, bodies without legs etc. and all the context that might have been lost from an image. But maybe that was all they could do at the time. You have to start somewhere. |
Beta Was this translation helpful? Give feedback.
-
I'm not quite sure what you mean here. Stable-Diffusion was not trained on clean images with correct captions, it was trained on 2.1 Billion images scraped from the internet, with the captions mostly being whatever alt-text existed for the image. While I'm sure you can think of tons of smarter ways of processing the data, I'm also willing to beat those ways would consume more time than a simple crop, and when you are processing 2.1 Billion images, every second of processing ends up mattering. Stable-Diffusion was very much based on the idea that quantity of data mattered more than quality of data, at least for an initial proof of concept. |
Beta Was this translation helpful? Give feedback.
-
I believe that result would be better if instead of cropping, you would increase borders of image. If knight does not fit in aspect ratio then rather than cropping him, increase width in left and right directions so that aspect ratio is 1:1, then scale down to 512:512 I do not think it would be much more cpu expensive than current crop and even then, result would outweight the negatives |
Beta Was this translation helpful? Give feedback.
-
IMO, inpainting to make a square based on the longer dimension would be a nice way to deal with this problem. Also inpainting over any watermarks would be great too. Might burn up a lot of compute until we can get image generation down to 3-4 steps though. When I inpaint it usually takes 12-15 steps to get a decent result. |
Beta Was this translation helpful? Give feedback.
-
The generated cropped images is due to out of frame random cropping. The dodgy anatomy is due to the millions of various configurations of the human form all being described by very similar text. People don't write down 'an image of a person gripping a spear in the left hand and their right hand in an open gesture' it is instead 'a person holding a spear and gesturing' or similar. Think of all the things people would commonly describe as 'holding' , vs things people would describe as 'gripping' for example. Using the word 'gripping' instead of 'holding' can help generate better quality hands when paired with a congruent context. If you carefully inspect the dodgy anatomy, you'll see two left hands, or hands that are half facing forward and half facing backwards, etc. This is due to the model having ZERO understanding of 3 dimensional objects and how a 3D object creates a 2D scene. It's all a statistical pixel generation that gives us the illusion of some kind of understanding. |
Beta Was this translation helpful? Give feedback.
-
NovelAI have now released their code for training non-square ratios! https://github.com/NovelAI/novelai-aspect-ratio-bucketing |
Beta Was this translation helpful? Give feedback.
-
From https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac
They claim that the problem with frequent cut heads/feet originate from cropping all images square cutting off limbs and heads in the process. So they train "a man" and the model sees a male torso with hands and knees.
I've a hard time to believe that, wouldn't that ruin all the optimization work to input clean images with correct captions ?
Wouldn't one think you better resize an image (causing borders of 'nothing') then train on that, even if it loses details it will at least show the actual object on the caption instead of a corpse?!
Beta Was this translation helpful? Give feedback.
All reactions