About muti-GPU on a big dataset. #10

zhaobingbingbing · 2022-09-09T01:11:07Z

Hi, Thanks for your script.
When I train imagen with multi gpu on a subset of laion, about 7M.
I found that use 'CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 imagen.py --train...'
is faster than 'accelerate launch imagen.py'
Although the utilization of GPU in the first way is lower than the second way.
And I found there is a bottleneck in data processing (class ImageLabelDataset of data_generator.py), GPU always needs to wait for data processing.
Now, the training speed of all those two ways are too slow. Do you have some advice?
Thanks again.

deepglugs · 2022-09-09T01:31:29Z

I'm also struggling to get multi-gpu to work with reasonable speeds. If there's a bottleneck in the dataloader, you can try to increase the number of workers with --workers or you can preprocess your data. For training the first unet, that would involve resizing and padding the images to 64x64.

If you find any other ways to improve multi-gpu, let me know.

zhaobingbingbing · 2022-09-09T01:44:45Z

I found most transform method in DMs is like,
self.transform = T.Compose([
T.Resize(image_size),
T.RandomHorizontalFlip(),
T.CenterCrop(image_size),
])
is padding necessary?

deepglugs · 2022-09-09T02:08:06Z

padding is necessary if the images aren't already square, otherwise they will distort. CenterCrop also achieves this but you will lose data.

deepglugs · 2022-09-09T12:44:45Z

update: I am working on switching to use webdataset as an optional alternative to ImageLabelDataset. So far I have observed it is much faster, but I haven't gotten it working with multi-gpu yet. Once I do, I'll push the change (or maybe I'll push it sooner since it's a non-default option).

deepglugs · 2022-09-10T01:53:15Z

Pushed webdataset. Multi-gpu now works fast, although, I'm not sure everything is well. When training unet2, I see loss=0.0, which isn't right. Debugging continues...

zhaobingbingbing · 2022-09-13T04:20:48Z

When I reduced the dataset from 7M to 100k, the training speed is fast, about 0.5h an epoch, however, it will cost 200h for 7M.

deepglugs · 2022-09-13T04:22:28Z

Is that with webdatasets or the default?

zhaobingbingbing · 2022-09-13T04:27:38Z

The default way.
The problem seems to be in the data processing. When the dataset is too large, time is used to obtain data for each batch_size, rather than training. If I find a way to improve, I will share it with you.

zhaobingbingbing · 2022-09-14T02:56:05Z

Hi, for 100k image-txt pairs, I find during the first few epochs, the loss drops significantly(from 0.6-0.02). After 5 epochs, the loss almost won't decline. But the sampling quality still increases. So when should I choose to stop training?

deepglugs · 2022-09-14T04:02:51Z

loss will go down slowly after a while. This is from one of my longer runs. You can even lower the learning rate and that might help drop loss a bit more (but still very slow). I usually train --lr 1e-4 until loss stops dropping and then use --lr 1e-5. Increasing the batch size has also been known to help drop learning rate.

zhaobingbingbing · 2022-10-11T07:47:35Z

My loss is similar to yours. For longer training , the loss is still difficult to reduce. I have tried to reduce the LR. But I found the sampling quality will be better for longer training. So except to loss, what can be used as a standard for convergence?

deepglugs · 2022-10-11T11:11:28Z

Image sample quality is the best method I know of

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About muti-GPU on a big dataset. #10

About muti-GPU on a big dataset. #10

zhaobingbingbing commented Sep 9, 2022

deepglugs commented Sep 9, 2022

zhaobingbingbing commented Sep 9, 2022

deepglugs commented Sep 9, 2022 •

edited

Loading

deepglugs commented Sep 9, 2022

deepglugs commented Sep 10, 2022

zhaobingbingbing commented Sep 13, 2022

deepglugs commented Sep 13, 2022

zhaobingbingbing commented Sep 13, 2022

zhaobingbingbing commented Sep 14, 2022

deepglugs commented Sep 14, 2022

zhaobingbingbing commented Oct 11, 2022

deepglugs commented Oct 11, 2022

About muti-GPU on a big dataset. #10

About muti-GPU on a big dataset. #10

Comments

zhaobingbingbing commented Sep 9, 2022

deepglugs commented Sep 9, 2022

zhaobingbingbing commented Sep 9, 2022

deepglugs commented Sep 9, 2022 • edited Loading

deepglugs commented Sep 9, 2022

deepglugs commented Sep 10, 2022

zhaobingbingbing commented Sep 13, 2022

deepglugs commented Sep 13, 2022

zhaobingbingbing commented Sep 13, 2022

zhaobingbingbing commented Sep 14, 2022

deepglugs commented Sep 14, 2022

zhaobingbingbing commented Oct 11, 2022

deepglugs commented Oct 11, 2022

deepglugs commented Sep 9, 2022 •

edited

Loading