Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable diffusion v1 lora训练显存占用异常 #465

Open
2 of 4 tasks
ultranationalism opened this issue Apr 26, 2024 · 2 comments
Open
2 of 4 tasks

stable diffusion v1 lora训练显存占用异常 #465

ultranationalism opened this issue Apr 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ultranationalism
Copy link
Contributor

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

  • please tell us what kind of hardware can reproduce your error?
    请告诉我们您报错的后端类型
    • Ascend
    • GPU:3080 12G
    • CPU

Software Environment | 软件环境

  • MindSpore version:
    请告诉我们您正在使用的MindSpore版本:
    • 2.2.3
  • Python version(3.9.5):
  • OS(wsl2 docker desktop,Ubuntu 18.04.6 LTS)
  • GCC/Compiler version:9

Describe the current behavior | 目前输出

e.g. the current output is xxx/ the error is xxx/
目前的输出是xx 、 目前的报错是关于xx

MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1
Distributed mode: False
Data path: datasets/chinese_art_blip/train
Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863)
Num trainable params: 797,184
Precision: Float16
Use LoRA: True
LoRA rank: 4
Learning rate: 0.0001
Batch size: 1
Weight decay: 0.01
Grad accumulation steps: 1
Num epochs: 200
Loss scaler: dynamic
Init loss scale: 65536.0
Grad clipping: True
Max grad norm: 1.0
EMA: False
Enable flash attention: False

[2024-04-26 00:50:13] INFO: Start training...
[WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200].
[ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446
Traceback (most recent call last):
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in
main(args)
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main
model.train(
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(*next_element)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct
loss = self.network(*inputs) # mini-batch loss
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct
return self.p_losses(x, c, t)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses
model_output = self.apply_model(
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model
x_recon = self.model(x_noisy, t, **cond, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct
out = self.diffusion_model(x, t, context=context, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct
h = cell(h, emb, context)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct
h = self.in_layers_norm(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct
return super().construct(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct
self.check_input_dim(F.shape(x), self.cls_name)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/array_func.py", line 1510, in shape
return shape
(input_x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call
return x.shape
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape
self.stub_shape = self.stub.get_shape()
RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能
使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存

Steps to reproduce the issue | 复现报错的步骤

export DEVICE_ID=0

for non-INFNAN, keep drop overflow update False

export MS_ASCEND_CHECK_OVERFLOW_MODE=1

#export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" # debuggin

task_name=train_lora_sdv1 #rewrite
output_path=outputs
output_dir=$output_path/$task_name

rm -rf $output_dir
mkdir -p $output_dir
python train_text_to_image.py
--train_config "configs/train/train_config_lora_v1.yaml"
--data_path "datasets/chinese_art_blip/train"
--output_path $output_dir
--pretrained_model_path "models/AnythingV5.ckpt"
--loss_scaler_type "dynamic"
--init_loss_scale 65536
--enable_flash_attention=False
--drop_overflow_update=True
--use_ema=False
--lora_rank=4
--epochs=200
--ckpt_save_interval=20
--mode 1
--train_batch_size=1 \

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

@ultranationalism ultranationalism added the bug Something isn't working label Apr 26, 2024
@Songyuanwei
Copy link
Collaborator

建议设置静态图模式。mode为0,应该可以在12G显卡运行

@ultranationalism
Copy link
Contributor Author

建议设置静态图模式。mode为0,应该可以在12G显卡运行

经测试,静态图模式下mindspore占用完我预留给容器的56G内存后直接导致了我docker容器崩溃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants