Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process hangs in CPU detection step #6

Closed
sjaenick opened this issue Feb 24, 2022 · 23 comments
Closed

Process hangs in CPU detection step #6

sjaenick opened this issue Feb 24, 2022 · 23 comments
Labels
bug Something isn't working onnxruntime onnxruntime on SLURM hangs question Further information is requested SLURM

Comments

@sjaenick
Copy link

sjaenick commented Feb 24, 2022

python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset;
process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C
(needs to be killed instead).

ribodetector_cpu \
  -l 100 -t 10 \
  -e norrna \
  -i ../SRR3569371/SRR3569371_1.fastq ../SRR3569371/SRR3569371_2.fastq \
  -o read1.fq read2.fq

When invoked with python -m trace --trace, it seems to get stuck in the CPU detection step:

detect_cpu.py(71):             cd, self.config['state_file'][model_file_ext]).replace('.pth', '.onnx')
detect_cpu.py(70):         self.model_file = os.path.join(
detect_cpu.py(74):         so = onnxruntime.SessionOptions()
detect_cpu.py(77):         so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
detect_cpu.py(79):         self.model = onnxruntime.InferenceSession(self.model_file, so)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(315):         Session.__init__(self)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(104):         self._sess = None
onnxruntime_inference_collection.py(105):         self._enable_fallback = True
onnxruntime_inference_collection.py(317):         if isinstance(path_or_bytes, str):
onnxruntime_inference_collection.py(318):             self._model_path = path_or_bytes
onnxruntime_inference_collection.py(319):             self._model_bytes = None
onnxruntime_inference_collection.py(326):         self._sess_options = sess_options
onnxruntime_inference_collection.py(327):         self._sess_options_initial = sess_options
onnxruntime_inference_collection.py(328):         self._enable_fallback = True
onnxruntime_inference_collection.py(329):         self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
 --- modulename: _collections_abc, funcname: get
_collections_abc.py(659):         try:
_collections_abc.py(660):             return self[key]
 --- modulename: os, funcname: __getitem__
os.py(671):         try:
os.py(672):             value = self._data[self.encodekey(key)]
 --- modulename: os, funcname: encode
os.py(749):             if not isinstance(value, str):
os.py(751):             return value.encode(encoding, 'surrogateescape')
os.py(673):         except KeyError:
os.py(675):             raise KeyError(key) from None
_collections_abc.py(661):         except KeyError:
_collections_abc.py(662):             return default
onnxruntime_inference_collection.py(332):         disabled_optimizers = kwargs['disabled_optimizers'] if 'disabled_optimizers' in kwargs else None
onnxruntime_inference_collection.py(334):         try:
onnxruntime_inference_collection.py(335):             self._create_inference_session(providers, provider_options, disabled_optimizers)
 --- modulename: onnxruntime_inference_collection, funcname: _create_inference_session
onnxruntime_inference_collection.py(347):         available_providers = C.get_available_providers()
onnxruntime_inference_collection.py(350):         if 'TensorrtExecutionProvider' in available_providers:
onnxruntime_inference_collection.py(353):             self._fallback_providers = ['CPUExecutionProvider']
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
onnxruntime_inference_collection.py(357):                                                                         provider_options,
onnxruntime_inference_collection.py(358):                                                                         available_providers)
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
 --- modulename: onnxruntime_inference_collection, funcname: check_and_normalize_provider_args
onnxruntime_inference_collection.py(48):     if providers is None:
onnxruntime_inference_collection.py(49):         return [], []
onnxruntime_inference_collection.py(359):         if providers == [] and len(available_providers) > 1:
onnxruntime_inference_collection.py(366):         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
onnxruntime_inference_collection.py(367):         if self._model_path:
onnxruntime_inference_collection.py(368):             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)

I added some debugging output to verify self.model_file, which correctly points to the ribodetector_600k_variable_len70_101_epoch47.onnx file.

Any ideas?

@sjaenick
Copy link
Author

It seems to be CPU-related somehow:

Hangs on:

model name      : AMD EPYC 7742 64-Core Processor

Seems to work:

model name      : Intel(R) Xeon(R) CPU E5-4627 v4 @ 2.60GHz

@dawnmy
Copy link
Member

dawnmy commented Feb 24, 2022

Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime.

@sjaenick
Copy link
Author

Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python).

@dawnmy dawnmy added question Further information is requested bug Something isn't working labels Feb 24, 2022
@sjaenick
Copy link
Author

Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue.

@dawnmy
Copy link
Member

dawnmy commented Feb 26, 2022

That is great to hear. Could you run pip list in the environment you installed RiboDetector? Then I can specify the versions of required packages which worked for you when building RiboDetector package for pip.

@sjaenick
Copy link
Author

Not a virtual environment, so the list is a little bit longer..

Package                          Version
-------------------------------- -------------------
absl-py                          0.13.0
antismash                        5.1.2
argcomplete                      1.12.3
argh                             0.26.2
astunparse                       1.6.3
bagit                            1.7.0
bcbio-gff                        0.6.6
bertax                           0.1
biom-format                      2.1.8
biopython                        1.78
BUSCO                            5.2.2
CacheControl                     0.11.7
cachetools                       4.2.2
certifi                          2020.12.5
chardet                          4.0.0
checkm-genome                    1.1.3
click                            7.1.2
CMSeq                            1.0.1
coloredlogs                      15.0
concoct                          1.1.0
cwltool                          3.0.20201203173111
cycler                           0.10.0
Cython                           0.29.21
decorator                        4.4.2
DendroPy                         4.4.0
flatbuffers                      2.0
flye                             2.8.3
future                           0.18.2
gast                             0.4.0
gffutils                         0.10.1
google-auth                      1.32.1
google-auth-oauthlib             0.4.4
google-pasta                     0.2.0
grpcio                           1.34.1
h5py                             3.1.0
helperlibs                       0.2.1
humanfriendly                    9.1
idna                             2.10
isodate                          0.6.0
Jinja2                           2.11.2
joblib                           0.16.0
Keras                            2.4.3
keras-bert                       0.88.0
keras-embed-sim                  0.9.0
keras-layer-normalization        0.15.0
keras-multi-head                 0.28.0
keras-nightly                    2.5.0.dev2021032900
keras-pos-embd                   0.12.0
keras-position-wise-feed-forward 0.7.0
Keras-Preprocessing              1.1.2
keras-self-attention             0.50.0
keras-transformer                0.39.0
kiwisolver                       1.2.0
lockfile                         0.12.2
lxml                             4.6.2
Markdown                         3.3.4
MarkupSafe                       2.0.1
matplotlib                       3.3.1
MetaPhlAn                        3.0
mistune                          0.8.4
mypy-extensions                  0.4.3
networkx                         2.5
nose                             1.3.7
numpy                            1.22.2
oauthlib                         3.1.1
onnxruntime                      1.10.0
opt-einsum                       3.3.0
pandas                           1.1.2
PhyloPhlAn                       3.0.0
Pillow                           7.2.0
pip                              22.0.3
protobuf                         3.19.4
prov                             1.5.1
psutil                           5.8.0
pyasn1                           0.4.8
pyasn1-modules                   0.2.8
pydot                            1.4.1
pyfaidx                          0.6.2
pyparsing                        2.4.7
pysam                            0.16.0.1
pyScss                           1.3.7
pysvg-py3                        0.2.2.post3
python-dateutil                  2.8.1
python-igraph                    0.9.7
pytz                             2020.1
PyYAML                           5.4.1
rdflib                           4.2.2
rdflib-jsonld                    0.5.0
requests                         2.25.1
requests-oauthlib                1.3.0
ribodetector                     0.2.3
rsa                              4.7.2
ruamel.yaml                      0.16.5
schema-salad                     7.0.20201119201711
scikit-learn                     0.23.2
scipy                            1.5.2
seaborn                          0.11.0
sepp                             4.5.1
setuptools                       51.1.1
shellescape                      3.4.1
simplejson                       3.17.5
six                              1.15.0
tensorboard                      2.5.0
tensorboard-data-server          0.6.1
tensorboard-plugin-wit           1.8.0
tensorflow                       2.5.0
tensorflow-estimator             2.5.0
termcolor                        1.1.0
texttable                        1.6.4
threadpoolctl                    2.1.0
torch                            1.7.1
tqdm                             4.62.3
typing-extensions                3.7.4.3
urllib3                          1.26.2
Werkzeug                         2.0.1
wheel                            0.35.1
wrapt                            1.12.1

@dawnmy
Copy link
Member

dawnmy commented Feb 27, 2022

Thank you for sharing the package version list. will update the package soon

@dawnmy
Copy link
Member

dawnmy commented Mar 1, 2022

Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:

conda create -n ribodetector_0.2.4 python=3.8
conda activate ribodetector_0.2.4
git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
pip install .

Hope this update will work without any issue.

@dawnmy
Copy link
Member

dawnmy commented Mar 2, 2022

I will close this issue as it seems to be solved.

@dawnmy dawnmy closed this as completed Mar 2, 2022
@dawnmy
Copy link
Member

dawnmy commented Mar 2, 2022

@sjaenick Could you provide more details about how did you solved this issue? It seems the other open issue #9 is related to this one. The multiprocessing in CPU mode has a compatibility issue with SLURM which hangs/freezes the process.

@dawnmy dawnmy reopened this Mar 2, 2022
@sjaenick
Copy link
Author

sjaenick commented Mar 2, 2022

Nothing but pip3 install --force-reinstall onnxruntime

@dawnmy dawnmy added the onnxruntime onnxruntime on SLURM hangs label Mar 3, 2022
@karl-az
Copy link

karl-az commented Mar 4, 2022

Hi I have a similar issue (also related to onnxruntime), but it results in another error:

Traceback (most recent call last):
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/bin/ribodetector_cpu", line 10, in <module>
    sys.exit(main())
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 526, in main
    seq_pred.load_model()
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 79, in load_model
    self.model = onnxruntime.InferenceSession(self.model_file, so)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /home/conda/feedstock_root/build_artifacts/onnxruntime_1639384799973/work/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:

Let me know if this goes into a separate issue.

I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2).

The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel.

I found this microsoft/onnxruntime#8313, hinting towards that at least my error could have something to do with sleeping CPUs.

@dawnmy
Copy link
Member

dawnmy commented Mar 4, 2022

@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: microsoft/onnxruntime#10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo).

@sjaenick
Copy link
Author

sjaenick commented Mar 4, 2022

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

@dawnmy
Copy link
Member

dawnmy commented Mar 4, 2022

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

Yes, I tried this as well with srun interactive job and the job got frozen. If you run it independently from SLURM, i.e. directly in ssh, everything is fine. I assume this SLURM related issue only present in the CPU mode, GPU mode should be fine.

@karl-az
Copy link

karl-az commented Mar 7, 2022

I'm able to resolve this by uncommenting this row:

# so.intra_op_num_threads = 2

I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed)

As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs.

@dawnmy
Copy link
Member

dawnmy commented Mar 7, 2022

@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in microsoft/onnxruntime#10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs (-t) I specified. Could you check the total CPU load?

@karl-az
Copy link

karl-az commented Mar 7, 2022

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

@dawnmy
Copy link
Member

dawnmy commented Mar 7, 2022

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set --cpus-per-task, it will use the default number of CPUs preconfigured by admin. If -t was set larger the the default slurm --cpus-per-task, the CPU load will be lower than expected.

So if you run ribodetector with SLURM, you should set cpus-per-task to what you want. e.g.:
for the interactive mode, start the session with
srun --qos interactive --cpus-per-task {number of CPUs you need} --threads-per-core 1 --pty /bin/bash
Then run ribodetector_cpu -t {number of CPUs you need} .... The current version of ribodetector needs to be updated i.e. change intra_op_num_threads = 1. I will update the repo soon. If it is urgent, you can just modify intra_op_num_threads by yourself.

@dawnmy
Copy link
Member

dawnmy commented Mar 7, 2022

The latest release v0.2.4 solved this issue. Please update to v0.2.4 with:

pip install ribodetector -U

@karl-az
Copy link

karl-az commented Mar 7, 2022

Thank you very much! I will pick it up when the release reaches bioconda.

@dawnmy
Copy link
Member

dawnmy commented Mar 8, 2022

It is available on bioconda now.

@karl-az
Copy link

karl-az commented Mar 8, 2022

Can confirm that it works for me. Thank you!

@dawnmy dawnmy closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working onnxruntime onnxruntime on SLURM hangs question Further information is requested SLURM
Projects
None yet
Development

No branches or pull requests

3 participants