Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pangu draw 3.0 执行 ./run_sampling.sh 报错 #317

Open
1 task
wanghuan-kunpneg opened this issue Jan 29, 2024 · 8 comments
Open
1 task

pangu draw 3.0 执行 ./run_sampling.sh 报错 #317

wanghuan-kunpneg opened this issue Jan 29, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@wanghuan-kunpneg
Copy link

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

  • please tell us what kind of hardware can reproduce your error?
    请告诉我们您报错的后端类型
    • [ 910 ] Ascend

Software Environment | 软件环境

  • MindSpore version:
    请告诉我们您正在使用的MindSpore版本:

    • 2.2.10
  • Python version( 3.8.8):

  • OS(centOS 8.2)

  • GCC/Compiler version:8.5.0

Describe the current behavior | 目前输出

[root@n1 pangu_draw_v3]# ./run_sampling.sh
flash attention is available.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Loading model from ['/wanghuan/low_timestamp_model.ckpt']
[ERROR] ME(855962:281473315727456,MainProcess):2024-01-29-02:07:34.586.661 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto
checkpoint_list.ParseFromString(pb_content)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString
return self.MergeFromString(serialized)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 703, in DecodeRepeatedField
raise _DecodeError('Truncated message.')
google.protobuf.message.DecodeError: Truncated message.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "pangu_sampling.py", line 454, in
sample(args)
File "pangu_sampling.py", line 322, in sample
model, filter = create_model(
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model
model = load_model_from_config(config.model, checkpoints, amp_level=amp_level)
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config
_sd_dict = ms.load_checkpoint(ckpt)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint
checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto
raise ValueError(err_info) from e
ValueError: Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能

  1. 如何解决报错

Steps to reproduce the issue | 复现报错的步骤

  1. code url | 代码链接:
  2. command that can reproduce your error | 可以复现报错的命令
    e.g. cd xx -> bash scripts/xx.sh --config xx
  3. xx

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

@wanghuan-kunpneg wanghuan-kunpneg added the bug Something isn't working label Jan 29, 2024
@townwish4git
Copy link
Contributor

根据报错:Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file.和google.protobuf.message.DecodeError: Truncated message.,怀疑文件损坏,您可以检查一下本地ckpt文件的sha256和下载链接里ckpt文件名后缀是否一致

@wanghuan-kunpneg
Copy link
Author

修改后依然报错,请问:
1、仍然是ckpt文件的问题吗?
2、出现了内存错误,是否跟运行环境有关?
目前采用了单卡910,32核64G,是否不满足运行要求?


重新下载了两个文件,检查了校验值
[root@n1 ckpt]# ll
总用量 27032600
-rwxr-xr-x. 1 root root 13840689166 12月 22 10:13 pangu_high_timestamp-c6344411.ckpt
-rwxr-xr-x. 1 root root 13840689166 12月 22 10:24 pangu_low_timestamp-127da122.ckpt

[root@n1 ckpt]# pwd

/wanghuan/ckpt

[root@n1 ckpt]# sha256sum pangu_high_timestamp-c6344411.ckpt

c6344411e5f889941e6f6b9653499c476adb598b0a520877cf1a86d931e6e041 pangu_high_timestamp-c6344411.ckpt

[root@n1 ckpt]# sha256sum pangu_low_timestamp-127da122.ckpt

127da12275180c72e82e6173b8dd80d099507dcf2546fa139cdf4bde1d196965 pangu_low_timestamp-127da122.ckpt

修改脚本路径:
"run_sampling.sh" 16L, 560C 15,69 全部
export MS_PYNATIVE_GE=1
export current_dir=/wanghuan/pangu_draw_v3
export PYTHONPATH=$current_dir:$PYTHONPATH
cd $current_dir

run script

When the device is running low on memory, the '--offload' parameter might be effective.

python pangu_sampling.py
--device_target "Ascend"
--ms_mode 1
--ms_amp_level "O2"
--config "configs/inference/pangu_sd_xl_base.yaml"
--high_solution
--weight "/wanghuan/ckpt/pangu_low_timestamp-c6344411.ckpt"
--high_timestamp_weight "/wanghuan/ckpt/pangu_high_timestamp-127da122.ckpt"
--prompts_file "prompts.txt"

报错信息:
[root@n1 pangu_draw_v3]# ./run_sampling.sh
flash attention is available.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Loading model from ['/wanghuan/ckpt/pangu_low_timestamp-127da122.ckpt']
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.028 [mindspore/train/serialization.py:1378] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint.
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.204 [mindspore/train/serialization.py:1383] conditioner.embedders.0.transformer.text_model.embeddings.position_ids is not loaded.
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.276 [mindspore/train/serialization.py:1383] conditioner.embedders.1.model.attn_mask is not loaded.
missing keys:
['conditioner.embedders.0.transformer.text_model.embeddings.position_ids', 'conditioner.embedders.1.model.attn_mask']
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Loading model from ['/wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt']
[ERROR] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:37.590.458 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto
checkpoint_list.ParseFromString(pb_content)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString
return self.MergeFromString(serialized)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField
if value._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 632, in DecodeField
field_dict[key] = buffer[pos:new_pos].tobytes()
MemoryError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "pangu_sampling.py", line 454, in
sample(args)
File "pangu_sampling.py", line 333, in sample
high_timestamp_model, _ = create_model(
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model
model = load_model_from_config(config.model, checkpoints, amp_level=amp_level)
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config
_sd_dict = ms.load_checkpoint(ckpt)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint
checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto
raise ValueError(err_info) from e
ValueError: Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.

@townwish4git
Copy link
Contributor

这是您的ckpt文件:
[root@n1 ckpt]# ll
总用量 27032600
-rwxr-xr-x. 1 root root 13840689166 12月 22 10:13 pangu_high_timestamp-c6344411.ckpt
-rwxr-xr-x. 1 root root 13840689166 12月 22 10:24 pangu_low_timestamp-127da122.ckpt

这是您运行的脚本:
python pangu_sampling.py
--device_target "Ascend"
--ms_mode 1
--ms_amp_level "O2"
--config "configs/inference/pangu_sd_xl_base.yaml"
--high_solution
--weight "/wanghuan/ckpt/pangu_low_timestamp-c6344411.ckpt"
--high_timestamp_weight "/wanghuan/ckpt/pangu_high_timestamp-127da122.ckpt"
--prompts_file "prompts.txt"

ckpt文件名似乎对不上?您可以检查一下两个ckpt文件是否正确命名和导入

@wanghuan-kunpneg
Copy link
Author

脚本拷贝错误了,后来发现文件名称改错误,已经修改成正确的了,错误提示不太一样

[root@n1 pangu_draw_v3]# ./run_sampling.sh
flash attention is available.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Loading model from ['/wanghuan/ckpt/pangu_low_timestamp-127da122.ckpt']
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.028 [mindspore/train/serialization.py:1378] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint.
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.204 [mindspore/train/serialization.py:1383] conditioner.embedders.0.transformer.text_model.embeddings.position_ids is not loaded.
[WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.276 [mindspore/train/serialization.py:1383] conditioner.embedders.1.model.attn_mask is not loaded.
missing keys:
['conditioner.embedders.0.transformer.text_model.embeddings.position_ids', 'conditioner.embedders.1.model.attn_mask']
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Loading model from ['/wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt']
[ERROR] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:37.590.458 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto
checkpoint_list.ParseFromString(pb_content)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString
return self.MergeFromString(serialized)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField
if value.add()._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField
if value._InternalParse(buffer, pos, new_pos) != new_pos:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 632, in DecodeField
field_dict[key] = buffer[pos:new_pos].tobytes()
MemoryError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "pangu_sampling.py", line 454, in
sample(args)
File "pangu_sampling.py", line 333, in sample
high_timestamp_model, _ = create_model(
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model
model = load_model_from_config(config.model, checkpoints, amp_level=amp_level)
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config
_sd_dict = ms.load_checkpoint(ckpt)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint
checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto
raise ValueError(err_info) from e
ValueError: Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.

@townwish4git
Copy link
Contributor

64G的显存正常加载这两个ckpt是没有问题的。您可以试着单独写个脚本 from mindspore import load_checkpoint来load_checkpoint(high_timestamp_model_file.ckpt)看看是否能正常加载ckpt文件

@wanghuan-kunpneg
Copy link
Author

910A是否需要用分布式推理?请问下运行方式是什么?

1、64G的内存。。。昇腾910A,显存32G,仍然内存错误
2、换了一台8卡Atlas800-9000的物理机,模型可以运行起来,报错信息如下:

Sampling with PanGuEulerEDMSampler for 40 steps: 100%|███████████████████████████████████| 40/40 [08:37<00:00, 12.93s/it]
Sample latent Done.
Decode latent Starting...
Traceback (most recent call last):
File "pangu_sampling.py", line 454, in
sample(args)
File "pangu_sampling.py", line 403, in sample
amp_level=args.ms_amp_level,
File "pangu_sampling.py", line 203, in run_txt2img
amp_level=amp_level,
File "/wanghuan/pangu_draw_v3/gm/models/diffusion.py", line 347, in pangu_do_sample
samples_x = self.decode_first_stage(samples_z)
File "/wanghuan/pangu_draw_v3/gm/models/diffusion.py", line 91, in decode_first_stage
out = self.first_stage_model.decode(z)
File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 718, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args, **kwargs)
File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 356, in call
output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError:

  • Memory not enough:

Device(id:0) memory isn't enough and alloc failed, kernel name: Default/decoder-Decoder/up-CellList/0-UpCell/block-CellList/0-ResnetBlock/norm1-_OutputTo16/_backbone-GroupNorm/Cast-op52827, alloc size: 1073741824B.


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:682 Run

@townwish4git
Copy link
Contributor

32G显存同时加载high/low_timestamp_model会面临显存问题,可以在推理脚本加入参数--offload

@ultranationalism
Copy link
Contributor

32G显存同时加载high/low_timestamp_model会面临显存问题,可以在推理脚本加入参数--offload

用fp16跑,实际上只用了14G显存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants