Someone please tell me this is not how stable diffusion was really trained ? #2207

cmp-nct · 2022-10-10T21:03:33Z

cmp-nct
Oct 10, 2022

From https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac

They claim that the problem with frequent cut heads/feet originate from cropping all images square cutting off limbs and heads in the process. So they train "a man" and the model sees a male torso with hands and knees.
I've a hard time to believe that, wouldn't that ruin all the optimization work to input clean images with correct captions ?

Wouldn't one think you better resize an image (causing borders of 'nothing') then train on that, even if it loses details it will at least show the actual object on the caption instead of a corpse?!

C43H66N12O12S2 · 2022-10-10T21:12:56Z

C43H66N12O12S2
Oct 10, 2022
Collaborator

Yes, this is how it was trained. Upcoming models are purported to improve this deficiency, and v3 is likely a native 1024x1024 models which sort of mitigates this issue by default.

4 replies

cmp-nct Oct 10, 2022
Author

Maybe they are that well funded that half a million USD in training did not matter ? Better to train a model with errors now and release it than to work for another month on it and be late ?

Why would 1024x1024 fix it ? It's still square, if you load a 16:9 image you need to crop or resize it (regardless how many pixel it has).
To me it sounds like a TON of work went into creating the model and theoretical training concept, then the actual essential part was just dumped.
I'd not be surprised if our "extra arms/legs" problem also partly comes from there.
The model after all has seen millions of people with 1 arm or multiple arms (from the cropped person next)
You can't fix that afterward, it needs a restart from zero, I hope they resize without crop with the V3 one.

C43H66N12O12S2 Oct 10, 2022
Collaborator

v2 and v3 is a restart from zero. 1024x1024 would improve it, not fix. It would improve because a 1024x1024 is more likely to be in-frame than a 512x512 one.

AFAIK they own, not rent, their GPU cluster so "half a million USD in training" is kinda misleading. That figure is what it would cost if you were to rent these GPUs.

cmp-nct Oct 10, 2022
Author

Ah ok so a marketing gag on the costs. I've been renting servers on Amazon at 15,000 USD a month and I later switched to dedicated hosting and got a stronger server for less than 1000 USD per month.
AWS/GCE are great and cheap for tiny stuff, if you want larger storage, lots of processing or dare you GPU time then you pay 15-20 fold the price of what it is worth.

Polyphron Oct 11, 2022

you simply do not curate 5 billion images. It's easier to train it like this and then force it into a mold later with guidance.

garlan999 · 2022-10-10T21:48:54Z

garlan999
Oct 10, 2022

The cuttting or cropping of images probably is a major part of the reason we have so much incoherence, especially around the human form. Legs without bodies, bodies without legs etc. and all the context that might have been lost from an image. But maybe that was all they could do at the time. You have to start somewhere.

2 replies

cmp-nct Oct 10, 2022
Author

It sounds like a rookie mistake to do that in the first place.
Just group the images based on size and adapt them so they fit without cropping.
Even without any development (masking support in the model) we could all live with occasional "white borders" in comparison to 9 fingers and 1.5 heads on a photo of a "beautiful model on the beach"

Or quick and dirty pre-run all crop-required photos through CLIP, before and after crop. Compare the resulting text.
I could think of a ton of rather easy solutions that all might not be perfect but wouldn't permanently damage the model, cropping does that.

victorca25 Oct 11, 2022

It's not a "rookie mistake", it's the common practice in AI research.

You want to extract as much information as possible from images at their max resolution (no resizing) and the way batching works forces to have all images of the same dimensions.

Random cropping helps, but it makes training slower and prone to become unstable in some cases. Now that there are stable checkpoints, they can start using fancier techniques to make it generalize better in future versions, there's really nothing strange about it if you understand how it works.

EliEron · 2022-10-11T00:37:53Z

EliEron
Oct 11, 2022

I've a hard time to believe that, wouldn't that ruin all the optimization work to input clean images with correct captions ?

I'm not quite sure what you mean here. Stable-Diffusion was not trained on clean images with correct captions, it was trained on 2.1 Billion images scraped from the internet, with the captions mostly being whatever alt-text existed for the image.

While I'm sure you can think of tons of smarter ways of processing the data, I'm also willing to beat those ways would consume more time than a simple crop, and when you are processing 2.1 Billion images, every second of processing ends up mattering. Stable-Diffusion was very much based on the idea that quantity of data mattered more than quality of data, at least for an initial proof of concept.

2 replies

asdfwonka69 Oct 20, 2022

exactly. just go browse the database. a huge range of aspect ratio and image size. then a surprising number of the captions are just the URL it was scraped from, based on what i looked at.

cmp-nct Oct 20, 2022
Author

I've a hard time to believe that, wouldn't that ruin all the optimization work to input clean images with correct captions ?

I'm not quite sure what you mean here. Stable-Diffusion was not trained on clean images with correct captions, it was trained on 2.1 Billion images scraped from the internet, with the captions mostly being whatever alt-text existed for the image.

While I'm sure you can think of tons of smarter ways of processing the data, I'm also willing to beat those ways would consume more time than a simple crop, and when you are processing 2.1 Billion images, every second of processing ends up mattering. Stable-Diffusion was very much based on the idea that quantity of data mattered more than quality of data, at least for an initial proof of concept.

Afaik the caption problem was solved with SD ? That was an issue with older training.
And yes, I can think of a ton of better ways, many of them can easily deal with 2.1 billion images without a performance or time concern and others comes with a potential increased cost.
I can guarantee you, if I am processing 2.1 billion images for a model I'll ensure they are reasonably preprocessed or handled to not destroy the model with headless corpses.

Simple cropping would be fine for a test-run, to see if my concept is working or to quickly get some demo up for a funding round.
But when I am serious about creating a model it's a no-go

x02Sylvie · 2022-10-11T01:43:21Z

x02Sylvie
Oct 11, 2022

I believe that result would be better if instead of cropping, you would increase borders of image. If knight does not fit in aspect ratio then rather than cropping him, increase width in left and right directions so that aspect ratio is 1:1, then scale down to 512:512

I do not think it would be much more cpu expensive than current crop and even then, result would outweight the negatives

5 replies

victorca25 Oct 11, 2022

You'll get much blurrier results because of the downsizing, which is why it's not used

cmp-nct Oct 20, 2022
Author

You'll get much blurrier results because of the downsizing, which is why it's not used

There are many methods that can tackle that issue of aspect ratio.
From a model perspective I'd choose a blurrier training image over a head and feet less one any day.
Just CLIP tag them with "lowres" and you might even teach the model for quality results simultaneously.
You can test them all with pattern recognition, you BLIP/CLIP them anyway so you can use those results to detect known series of cut/off defects so you can leave those away.

But if you train it with corpses and give them names of actual things, then you destroy the relationship of the CLIP tags with the actual content.
Better to not train a wrong image than to train it.

victorca25 Oct 21, 2022

From a model perspective I'd choose a blurrier training image over a head and feet less one any day.

Have you tried training any deep learning model yourself? Because that's exactly how you get a trash model with useless outputs.

cmp-nct Oct 21, 2022
Author

From a model perspective I'd choose a blurrier training image over a head and feet less one any day.

Have you tried training any deep learning model yourself? Because that's exactly how you get a trash model with useless outputs.

I've not trained image processing networks, so I am speaking from a conceptual understanding not from experience.
Training a person without head and telling the model that it is a person, that's how you destroy a model.
Better not training that image than to wrongly train, you'll damage weights each time you train with invalid data.
And it's exactly what we see happening with the results now

victorca25 Oct 22, 2022

I've not trained image processing networks, so I am speaking from a conceptual understanding not from experience.

That explains why you don't understand the problem. If you really want to understand without doing it yourself, try to answer these questions:

why you can't throw all images at their full size during training?
why are batches used during training?
what additional restrictions does the UNet impose vs. a regular fully convolutional network?
what does downsampling an image means?
what kernel(?) will you use during downsampling and why?

After being able to answer these you may see why the many thousands of researchers in the area have discarded the idea before and found random cropping to be the best compromise.

AUTOMATIC1111 · 2022-10-11T13:52:26Z

AUTOMATIC1111
Oct 11, 2022
Maintainer

I do it like this for my TI datasets.

1 reply

fredconex Oct 20, 2022

Is this specific for TI or also for hypernetwork?

Luke2642 · 2022-10-21T11:23:56Z

Luke2642
Oct 21, 2022

2 replies

victorca25 Oct 21, 2022

It's the same base network as pix2pix (2017) and pix2pixHD (2019, of I remember correctly), the UNet, the same way to train and the same utilization during inference (square resolution vs. root of 2 dimensions).

It's not related to the transformers or positional encoders, but you can use the negative tokens to ask the network to not give "out of frame" subject. It's not really an issue.

Luke2642 Oct 21, 2022

Thanks, I got totally confused there!

scorbon · 2022-10-21T16:12:39Z

scorbon
Oct 21, 2022

IMO, inpainting to make a square based on the longer dimension would be a nice way to deal with this problem. Also inpainting over any watermarks would be great too. Might burn up a lot of compute until we can get image generation down to 3-4 steps though. When I inpaint it usually takes 12-15 steps to get a decent result.

0 replies

Tollanador · 2022-10-21T23:49:56Z

Tollanador
Oct 21, 2022

The generated cropped images is due to out of frame random cropping.

The dodgy anatomy is due to the millions of various configurations of the human form all being described by very similar text.

People don't write down 'an image of a person gripping a spear in the left hand and their right hand in an open gesture' it is instead 'a person holding a spear and gesturing' or similar. Think of all the things people would commonly describe as 'holding' , vs things people would describe as 'gripping' for example. Using the word 'gripping' instead of 'holding' can help generate better quality hands when paired with a congruent context.

If you carefully inspect the dodgy anatomy, you'll see two left hands, or hands that are half facing forward and half facing backwards, etc.

This is due to the model having ZERO understanding of 3 dimensional objects and how a 3D object creates a 2D scene. It's all a statistical pixel generation that gives us the illusion of some kind of understanding.
A physics aware/Object aware model would help mitigate this problem. The difficulty is having a quality dataset that is big enough to self-learn enough that it 'understands' left/right back/front , in such a way that it 'knows' a hand can't be both a left and right hand, forward and back, at the same time.

1 reply

grexzen Nov 9, 2022

Or hire well paid human taggers to create an amazing set. All those sets come from crawlers or unpaid people.

Luke2642 · 2022-11-08T16:59:39Z

Luke2642
Nov 8, 2022

NovelAI have now released their code for training non-square ratios!

https://github.com/NovelAI/novelai-aspect-ratio-bucketing

1 reply

cmp-nct Nov 8, 2022
Author

Interesting.
I have been digging into BLIP and CLIP code, they also were trained with random crops.
They used random translations and other randomization as well.

Someone please tell me this is not how stable diffusion was really trained ? #2207

Replies: 9 comments · 18 replies

C43H66N12O12S2 Oct 10, 2022 Collaborator

cmp-nct Oct 10, 2022 Author

C43H66N12O12S2 Oct 10, 2022 Collaborator

cmp-nct Oct 10, 2022 Author

cmp-nct Oct 10, 2022 Author

cmp-nct Oct 20, 2022 Author

cmp-nct Oct 20, 2022 Author

cmp-nct Oct 21, 2022 Author

AUTOMATIC1111 Oct 11, 2022 Maintainer

cmp-nct Nov 8, 2022 Author

Replies: 9 comments 18 replies

C43H66N12O12S2
Oct 10, 2022
Collaborator

cmp-nct Oct 10, 2022
Author

C43H66N12O12S2 Oct 10, 2022
Collaborator

cmp-nct Oct 10, 2022
Author

cmp-nct Oct 10, 2022
Author

cmp-nct Oct 20, 2022
Author

cmp-nct Oct 20, 2022
Author

cmp-nct Oct 21, 2022
Author

AUTOMATIC1111
Oct 11, 2022
Maintainer

cmp-nct Nov 8, 2022
Author