-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pangu draw 3.0 执行 ./run_sampling.sh 报错 #317
Comments
根据报错:Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file.和google.protobuf.message.DecodeError: Truncated message.,怀疑文件损坏,您可以检查一下本地ckpt文件的sha256和下载链接里ckpt文件名后缀是否一致 |
修改后依然报错,请问: 重新下载了两个文件,检查了校验值 [root@n1 ckpt]# pwd /wanghuan/ckpt [root@n1 ckpt]# sha256sum pangu_high_timestamp-c6344411.ckpt c6344411e5f889941e6f6b9653499c476adb598b0a520877cf1a86d931e6e041 pangu_high_timestamp-c6344411.ckpt [root@n1 ckpt]# sha256sum pangu_low_timestamp-127da122.ckpt 127da12275180c72e82e6173b8dd80d099507dcf2546fa139cdf4bde1d196965 pangu_low_timestamp-127da122.ckpt 修改脚本路径: run scriptWhen the device is running low on memory, the '--offload' parameter might be effective.python pangu_sampling.py 报错信息: The above exception was the direct cause of the following exception: Traceback (most recent call last): |
这是您的ckpt文件: 这是您运行的脚本: ckpt文件名似乎对不上?您可以检查一下两个ckpt文件是否正确命名和导入 |
脚本拷贝错误了,后来发现文件名称改错误,已经修改成正确的了,错误提示不太一样 [root@n1 pangu_draw_v3]# ./run_sampling.sh The above exception was the direct cause of the following exception: Traceback (most recent call last): |
64G的显存正常加载这两个ckpt是没有问题的。您可以试着单独写个脚本 from mindspore import load_checkpoint来load_checkpoint(high_timestamp_model_file.ckpt)看看是否能正常加载ckpt文件 |
910A是否需要用分布式推理?请问下运行方式是什么?1、64G的内存。。。昇腾910A,显存32G,仍然内存错误 Sampling with PanGuEulerEDMSampler for 40 steps: 100%|███████████████████████████████████| 40/40 [08:37<00:00, 12.93s/it]
|
32G显存同时加载high/low_timestamp_model会面临显存问题,可以在推理脚本加入参数--offload |
用fp16跑,实际上只用了14G显存 |
Thanks for sending an issue! Here are some tips for you:
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md
Hardware Environment | 硬件环境
请告诉我们您报错的后端类型
Ascend
Software Environment | 软件环境
MindSpore version:
请告诉我们您正在使用的MindSpore版本:
Python version( 3.8.8):
OS(centOS 8.2)
GCC/Compiler version:8.5.0
Describe the current behavior | 目前输出
[root@n1 pangu_draw_v3]# ./run_sampling.sh
flash attention is available.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Loading model from ['/wanghuan/low_timestamp_model.ckpt']
[ERROR] ME(855962:281473315727456,MainProcess):2024-01-29-02:07:34.586.661 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto
checkpoint_list.ParseFromString(pb_content)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString
return self.MergeFromString(serialized)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString
if self._InternalParse(serialized, 0, length) != length:
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse
pos = field_decoder(buffer, new_pos, end, self, field_dict)
File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 703, in DecodeRepeatedField
raise _DecodeError('Truncated message.')
google.protobuf.message.DecodeError: Truncated message.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pangu_sampling.py", line 454, in
sample(args)
File "pangu_sampling.py", line 322, in sample
model, filter = create_model(
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model
model = load_model_from_config(config.model, checkpoints, amp_level=amp_level)
File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config
_sd_dict = ms.load_checkpoint(ckpt)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint
checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode)
File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto
raise ValueError(err_info) from e
ValueError: Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file
Describe the expected behavior | 期望输出
please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能
Steps to reproduce the issue | 复现报错的步骤
e.g. cd xx -> bash scripts/xx.sh --config xx
Related log / screenshot | 完整日志
Special notes for this issue | 其他信息
The text was updated successfully, but these errors were encountered: