You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
please tell us what kind of hardware can reproduce your error?
请告诉我们您报错的后端类型
Ascend
GPU:3080 12G
CPU
Software Environment | 软件环境
MindSpore version:
请告诉我们您正在使用的MindSpore版本:
2.2.3
Python version(3.9.5):
OS(wsl2 docker desktop,Ubuntu 18.04.6 LTS)
GCC/Compiler version:9
Describe the current behavior | 目前输出
e.g. the current output is xxx/ the error is xxx/
目前的输出是xx 、 目前的报错是关于xx
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1
Distributed mode: False
Data path: datasets/chinese_art_blip/train
Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863)
Num trainable params: 797,184
Precision: Float16
Use LoRA: True
LoRA rank: 4
Learning rate: 0.0001
Batch size: 1
Weight decay: 0.01
Grad accumulation steps: 1
Num epochs: 200
Loss scaler: dynamic
Init loss scale: 65536.0
Grad clipping: True
Max grad norm: 1.0
EMA: False
Enable flash attention: False
[2024-04-26 00:50:13] INFO: Start training...
[WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200].
[ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446
Traceback (most recent call last):
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in
main(args)
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main
model.train(
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(*next_element)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct
loss = self.network(*inputs) # mini-batch loss
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct
return self.p_losses(x, c, t)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses
model_output = self.apply_model(
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model
x_recon = self.model(x_noisy, t, **cond, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct
out = self.diffusion_model(x, t, context=context, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct
h = cell(h, emb, context)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct
h = self.in_layers_norm(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct
return super().construct(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct
self.check_input_dim(F.shape(x), self.cls_name)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/array_func.py", line 1510, in shape
return shape(input_x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call
return x.shape
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape
self.stub_shape = self.stub.get_shape()
RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446
Describe the expected behavior | 期望输出
please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能
使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存
Thanks for sending an issue! Here are some tips for you:
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md
Hardware Environment | 硬件环境
请告诉我们您报错的后端类型
Ascend
GPU
:3080 12GCPU
Software Environment | 软件环境
请告诉我们您正在使用的MindSpore版本:
Describe the current behavior | 目前输出
e.g. the current output is xxx/ the error is xxx/
目前的输出是xx 、 目前的报错是关于xx
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1
Distributed mode: False
Data path: datasets/chinese_art_blip/train
Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863)
Num trainable params: 797,184
Precision: Float16
Use LoRA: True
LoRA rank: 4
Learning rate: 0.0001
Batch size: 1
Weight decay: 0.01
Grad accumulation steps: 1
Num epochs: 200
Loss scaler: dynamic
Init loss scale: 65536.0
Grad clipping: True
Max grad norm: 1.0
EMA: False
Enable flash attention: False
[2024-04-26 00:50:13] INFO: Start training...
[WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200].
[ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446
Traceback (most recent call last):
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in
main(args)
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main
model.train(
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(*next_element)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct
loss = self.network(*inputs) # mini-batch loss
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct
return self.p_losses(x, c, t)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses
model_output = self.apply_model(
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model
x_recon = self.model(x_noisy, t, **cond, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct
out = self.diffusion_model(x, t, context=context, **kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct
h = cell(h, emb, context)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct
h = self.in_layers_norm(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct
return super().construct(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct
self.check_input_dim(F.shape(x), self.cls_name)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/array_func.py", line 1510, in shape
return shape(input_x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call
return x.shape
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape
self.stub_shape = self.stub.get_shape()
RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446
Describe the expected behavior | 期望输出
please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能
使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存
Steps to reproduce the issue | 复现报错的步骤
export DEVICE_ID=0
for non-INFNAN, keep drop overflow update False
export MS_ASCEND_CHECK_OVERFLOW_MODE=1
#export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" # debuggin
task_name=train_lora_sdv1 #rewrite
output_path=outputs
output_dir=$output_path/$task_name
rm -rf $output_dir
mkdir -p $output_dir
python train_text_to_image.py
--train_config "configs/train/train_config_lora_v1.yaml"
--data_path "datasets/chinese_art_blip/train"
--output_path $output_dir
--pretrained_model_path "models/AnythingV5.ckpt"
--loss_scaler_type "dynamic"
--init_loss_scale 65536
--enable_flash_attention=False
--drop_overflow_update=True
--use_ema=False
--lora_rank=4
--epochs=200
--ckpt_save_interval=20
--mode 1
--train_batch_size=1 \
Related log / screenshot | 完整日志
Special notes for this issue | 其他信息
The text was updated successfully, but these errors were encountered: