Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] eval_descriptor have bug while used for MultiSystems #4533

Closed
QuantumMisaka opened this issue Jan 6, 2025 · 6 comments · Fixed by #4534
Closed

[BUG] eval_descriptor have bug while used for MultiSystems #4533

QuantumMisaka opened this issue Jan 6, 2025 · 6 comments · Fixed by #4534
Assignees
Labels

Comments

@QuantumMisaka
Copy link

Bug summary

While I using this scripts to generate descriptors by DeepPot.eval_descriptor

import dpdata
from deepmd.infer.deep_pot import DeepPot
import numpy as np
import os
import gc
import glob
import logging

datadir = "./data-clean-v2-7-20873-npy"
modelpath = "./FeCHO-dpa231-v2-7-3heads-150w.pt"
savedir = "descriptors"

omp = 16
proc = 4
os.environ['OMP_NUM_THREADS'] = f'{omp}'

def descriptor_from_model(sys: dpdata.LabeledSystem, model:DeepPot):
    coords = sys.data["coords"]
    cells = sys.data["cells"]
    model_type_map = model.get_type_map()
    type_trans = np.array([model_type_map.index(i) for i in sys.data['atom_names']])
    atypes = list(type_trans[sys.data['atom_types']])
    predict = model.eval_descriptor(coords, cells, atypes)
    return predict
#alldata = dpdata.MultiSystems.from_dir(datadir,datakey,fmt="deepmd/npy")
all_set_directories = glob.glob(os.path.join(
    datadir, '**', 'set.*'), recursive=True)
all_directories = set()
for directory in all_set_directories:
    coord_path = os.path.join(directory, 'coord.npy')
    if os.path.exists(coord_path):
        all_directories.add(os.path.dirname(directory))
all_directories = list(all_directories)

model = DeepPot(modelpath, head="Target_FTS")

logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s - %(message)s',  
    datefmt='%Y-%m-%d %H:%M:%S'  
)

logging.info("Start Generating Descriptors")

if not os.path.exists(savedir):
    os.mkdir(savedir)

with open("running", "w") as fo:
    for onedir in all_directories:
        onedata = dpdata.LabeledSystem(onedir, fmt="deepmd/npy")
        key = onedata.short_name
        save_key = f"{savedir}/{key}"
        logging.info(f"Generating descriptors for {key}")
        if os.path.exists(save_key):
            if os.path.exists(f"{save_key}/desc.npy"):
                logging.info(f"Descriptors for {key} already exist, skip")
                continue
        else:
            os.mkdir(save_key)
        desc = descriptor_from_model(onedata, model)
        logging.info(f"Descriptors for {key} generated")
        
        np.save(f"{savedir}/{key}/desc.npy", desc)
        logging.info(f"Descriptors for {key} saved")

logging.info("All Done !!!")
os.system("mv running done")

RuntimeError will arise after one eval_descriptor for LabeledSystem

2025-01-06 15:58:57 - INFO - Start Generating Descriptors
2025-01-06 15:58:57 - INFO - Generating descriptors for O0H6Fe48C8
2025-01-06 15:59:00 - INFO - Descriptors for O0H6Fe48C8 generated
2025-01-06 15:59:00 - INFO - Descriptors for O0H6Fe48C8 saved
2025-01-06 15:59:00 - INFO - Generating descriptors for O3H4Fe0C6
Traceback (most recent call last):
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-3h-100w/desc-gen/gen_desc.py", line 64, in <module>
    desc = descriptor_from_model(onedata, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-3h-100w/desc-gen/gen_desc.py", line 25, in descriptor_from_model
    predict = model.eval_descriptor(coords, cells, atypes)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/infer/deep_eval.py", line 445, in eval_descriptor
    descriptor = self.deep_eval.eval_descriptor(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 658, in eval_descriptor
    descriptor = model.eval_descriptor()
                 ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/model/model/dp_model.py", line 66, in eval_descriptor
    def eval_descriptor(self) -> torch.Tensor:
        """Evaluate the descriptor."""
        return self.atomic_model.eval_descriptor()
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 76, in eval_descriptor
    def eval_descriptor(self) -> torch.Tensor:
        """Evaluate the descriptor."""
        return torch.concat(self.eval_descriptor_list)
               ~~~~~~~~~~~~ <--- HERE
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 62 but got size 13 for tensor number 1 in the list.

DeePMD-kit Version

DeePMD-kit v3.0.0rc0

Backend and its version

Pytorch 2.5.1

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Model , dataset and scripts used is in
https://www.jianguoyun.com/p/DS0CkjUQrZ-XCRiyh-cFIAA (access code:4Te2ER)

Steps to Reproduce

  • tar -zxvf tar.gz file
  • run gen_desc.py in deepmd-kit 3.0.0-rc0 env with dpdata installed

Further Information, Files, and Links

No response

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 6, 2025
Fix deepmodeling#4533.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz linked a pull request Jan 6, 2025 that will close this issue
@njzjz njzjz self-assigned this Jan 6, 2025
@QuantumMisaka
Copy link
Author

@njzjz Thanks for your rapid reply!
Another related discussion: while directly using DeepPot.eval_descripor to deal with a LabeledSystem with large number of nframe (> 2000) in GPU, the memory requirement always up to > 40GB, lead to OOM error, do you have any advice for controling the memory consumption ?

@njzjz

This comment has been minimized.

@QuantumMisaka
Copy link
Author

@njzjz while using dp test, the memory in GPU seems to be able to self-adaptive, so is it possible to use this eval_descriptor method in cmd like what dp test is done ? Detailed suggestion in #4503

@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

Sorry, I just realized you mean a large number of frames, not atoms.

@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

The automatic batch size is used by eval. Both dp test and eval_descriptor call eval, so I believe the memory should be handled properly.

@QuantumMisaka
Copy link
Author

@njzjz Thanks for your reply !
I'll test twice after this bug fixed and open another issue if the related OOM problem exists

github-merge-queue bot pushed a commit that referenced this issue Jan 7, 2025
Fix #4533.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
  - Improved list clearing mechanism in `DPAtomicModel` class
  - Enhanced test coverage for descriptor evaluation in `TestDeepPot`

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz closed this as completed Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants