Process hangs in CPU detection step #6

sjaenick · 2022-02-24T13:09:36Z

python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset;
process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C
(needs to be killed instead).

ribodetector_cpu \
  -l 100 -t 10 \
  -e norrna \
  -i ../SRR3569371/SRR3569371_1.fastq ../SRR3569371/SRR3569371_2.fastq \
  -o read1.fq read2.fq

When invoked with python -m trace --trace, it seems to get stuck in the CPU detection step:

detect_cpu.py(71):             cd, self.config['state_file'][model_file_ext]).replace('.pth', '.onnx')
detect_cpu.py(70):         self.model_file = os.path.join(
detect_cpu.py(74):         so = onnxruntime.SessionOptions()
detect_cpu.py(77):         so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
detect_cpu.py(79):         self.model = onnxruntime.InferenceSession(self.model_file, so)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(315):         Session.__init__(self)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(104):         self._sess = None
onnxruntime_inference_collection.py(105):         self._enable_fallback = True
onnxruntime_inference_collection.py(317):         if isinstance(path_or_bytes, str):
onnxruntime_inference_collection.py(318):             self._model_path = path_or_bytes
onnxruntime_inference_collection.py(319):             self._model_bytes = None
onnxruntime_inference_collection.py(326):         self._sess_options = sess_options
onnxruntime_inference_collection.py(327):         self._sess_options_initial = sess_options
onnxruntime_inference_collection.py(328):         self._enable_fallback = True
onnxruntime_inference_collection.py(329):         self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
 --- modulename: _collections_abc, funcname: get
_collections_abc.py(659):         try:
_collections_abc.py(660):             return self[key]
 --- modulename: os, funcname: __getitem__
os.py(671):         try:
os.py(672):             value = self._data[self.encodekey(key)]
 --- modulename: os, funcname: encode
os.py(749):             if not isinstance(value, str):
os.py(751):             return value.encode(encoding, 'surrogateescape')
os.py(673):         except KeyError:
os.py(675):             raise KeyError(key) from None
_collections_abc.py(661):         except KeyError:
_collections_abc.py(662):             return default
onnxruntime_inference_collection.py(332):         disabled_optimizers = kwargs['disabled_optimizers'] if 'disabled_optimizers' in kwargs else None
onnxruntime_inference_collection.py(334):         try:
onnxruntime_inference_collection.py(335):             self._create_inference_session(providers, provider_options, disabled_optimizers)
 --- modulename: onnxruntime_inference_collection, funcname: _create_inference_session
onnxruntime_inference_collection.py(347):         available_providers = C.get_available_providers()
onnxruntime_inference_collection.py(350):         if 'TensorrtExecutionProvider' in available_providers:
onnxruntime_inference_collection.py(353):             self._fallback_providers = ['CPUExecutionProvider']
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
onnxruntime_inference_collection.py(357):                                                                         provider_options,
onnxruntime_inference_collection.py(358):                                                                         available_providers)
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
 --- modulename: onnxruntime_inference_collection, funcname: check_and_normalize_provider_args
onnxruntime_inference_collection.py(48):     if providers is None:
onnxruntime_inference_collection.py(49):         return [], []
onnxruntime_inference_collection.py(359):         if providers == [] and len(available_providers) > 1:
onnxruntime_inference_collection.py(366):         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
onnxruntime_inference_collection.py(367):         if self._model_path:
onnxruntime_inference_collection.py(368):             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)

I added some debugging output to verify self.model_file, which correctly points to the ribodetector_600k_variable_len70_101_epoch47.onnx file.

Any ideas?

The text was updated successfully, but these errors were encountered:

sjaenick · 2022-02-24T13:15:37Z

It seems to be CPU-related somehow:

Hangs on:

model name      : AMD EPYC 7742 64-Core Processor

Seems to work:

model name      : Intel(R) Xeon(R) CPU E5-4627 v4 @ 2.60GHz

dawnmy · 2022-02-24T17:24:24Z

Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime.

sjaenick · 2022-02-24T17:38:06Z

Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python).

sjaenick · 2022-02-25T10:12:09Z

Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue.

dawnmy · 2022-02-26T18:12:26Z

That is great to hear. Could you run pip list in the environment you installed RiboDetector? Then I can specify the versions of required packages which worked for you when building RiboDetector package for pip.

sjaenick · 2022-02-26T18:51:10Z

Not a virtual environment, so the list is a little bit longer..

Package                          Version
-------------------------------- -------------------
absl-py                          0.13.0
antismash                        5.1.2
argcomplete                      1.12.3
argh                             0.26.2
astunparse                       1.6.3
bagit                            1.7.0
bcbio-gff                        0.6.6
bertax                           0.1
biom-format                      2.1.8
biopython                        1.78
BUSCO                            5.2.2
CacheControl                     0.11.7
cachetools                       4.2.2
certifi                          2020.12.5
chardet                          4.0.0
checkm-genome                    1.1.3
click                            7.1.2
CMSeq                            1.0.1
coloredlogs                      15.0
concoct                          1.1.0
cwltool                          3.0.20201203173111
cycler                           0.10.0
Cython                           0.29.21
decorator                        4.4.2
DendroPy                         4.4.0
flatbuffers                      2.0
flye                             2.8.3
future                           0.18.2
gast                             0.4.0
gffutils                         0.10.1
google-auth                      1.32.1
google-auth-oauthlib             0.4.4
google-pasta                     0.2.0
grpcio                           1.34.1
h5py                             3.1.0
helperlibs                       0.2.1
humanfriendly                    9.1
idna                             2.10
isodate                          0.6.0
Jinja2                           2.11.2
joblib                           0.16.0
Keras                            2.4.3
keras-bert                       0.88.0
keras-embed-sim                  0.9.0
keras-layer-normalization        0.15.0
keras-multi-head                 0.28.0
keras-nightly                    2.5.0.dev2021032900
keras-pos-embd                   0.12.0
keras-position-wise-feed-forward 0.7.0
Keras-Preprocessing              1.1.2
keras-self-attention             0.50.0
keras-transformer                0.39.0
kiwisolver                       1.2.0
lockfile                         0.12.2
lxml                             4.6.2
Markdown                         3.3.4
MarkupSafe                       2.0.1
matplotlib                       3.3.1
MetaPhlAn                        3.0
mistune                          0.8.4
mypy-extensions                  0.4.3
networkx                         2.5
nose                             1.3.7
numpy                            1.22.2
oauthlib                         3.1.1
onnxruntime                      1.10.0
opt-einsum                       3.3.0
pandas                           1.1.2
PhyloPhlAn                       3.0.0
Pillow                           7.2.0
pip                              22.0.3
protobuf                         3.19.4
prov                             1.5.1
psutil                           5.8.0
pyasn1                           0.4.8
pyasn1-modules                   0.2.8
pydot                            1.4.1
pyfaidx                          0.6.2
pyparsing                        2.4.7
pysam                            0.16.0.1
pyScss                           1.3.7
pysvg-py3                        0.2.2.post3
python-dateutil                  2.8.1
python-igraph                    0.9.7
pytz                             2020.1
PyYAML                           5.4.1
rdflib                           4.2.2
rdflib-jsonld                    0.5.0
requests                         2.25.1
requests-oauthlib                1.3.0
ribodetector                     0.2.3
rsa                              4.7.2
ruamel.yaml                      0.16.5
schema-salad                     7.0.20201119201711
scikit-learn                     0.23.2
scipy                            1.5.2
seaborn                          0.11.0
sepp                             4.5.1
setuptools                       51.1.1
shellescape                      3.4.1
simplejson                       3.17.5
six                              1.15.0
tensorboard                      2.5.0
tensorboard-data-server          0.6.1
tensorboard-plugin-wit           1.8.0
tensorflow                       2.5.0
tensorflow-estimator             2.5.0
termcolor                        1.1.0
texttable                        1.6.4
threadpoolctl                    2.1.0
torch                            1.7.1
tqdm                             4.62.3
typing-extensions                3.7.4.3
urllib3                          1.26.2
Werkzeug                         2.0.1
wheel                            0.35.1
wrapt                            1.12.1

dawnmy · 2022-02-27T11:38:37Z

Thank you for sharing the package version list. will update the package soon

dawnmy · 2022-03-01T12:37:13Z

Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:

conda create -n ribodetector_0.2.4 python=3.8
conda activate ribodetector_0.2.4
git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
pip install .

Hope this update will work without any issue.

dawnmy · 2022-03-02T11:30:36Z

I will close this issue as it seems to be solved.

dawnmy · 2022-03-02T14:11:47Z

@sjaenick Could you provide more details about how did you solved this issue? It seems the other open issue #9 is related to this one. The multiprocessing in CPU mode has a compatibility issue with SLURM which hangs/freezes the process.

sjaenick · 2022-03-02T15:02:42Z

Nothing but pip3 install --force-reinstall onnxruntime

karl-az · 2022-03-04T15:39:46Z

Hi I have a similar issue (also related to onnxruntime), but it results in another error:

Traceback (most recent call last):
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/bin/ribodetector_cpu", line 10, in <module>
    sys.exit(main())
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 526, in main
    seq_pred.load_model()
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 79, in load_model
    self.model = onnxruntime.InferenceSession(self.model_file, so)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /home/conda/feedstock_root/build_artifacts/onnxruntime_1639384799973/work/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:

Let me know if this goes into a separate issue.

I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2).

The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel.

I found this microsoft/onnxruntime#8313, hinting towards that at least my error could have something to do with sleeping CPUs.

dawnmy · 2022-03-04T16:39:15Z

@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: microsoft/onnxruntime#10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo).

sjaenick · 2022-03-04T18:32:39Z

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

dawnmy · 2022-03-04T19:07:33Z

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

Yes, I tried this as well with srun interactive job and the job got frozen. If you run it independently from SLURM, i.e. directly in ssh, everything is fine. I assume this SLURM related issue only present in the CPU mode, GPU mode should be fine.

karl-az · 2022-03-07T08:36:28Z

I'm able to resolve this by uncommenting this row:

RiboDetector/ribodetector/detect_cpu.py

Line 75 in a49a054

# so.intra_op_num_threads = 2

I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed)

As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs.

dawnmy · 2022-03-07T10:29:36Z

@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in microsoft/onnxruntime#10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs (-t) I specified. Could you check the total CPU load?

karl-az · 2022-03-07T11:00:14Z

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

dawnmy · 2022-03-07T11:22:04Z

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set --cpus-per-task, it will use the default number of CPUs preconfigured by admin. If -t was set larger the the default slurm --cpus-per-task, the CPU load will be lower than expected.

So if you run ribodetector with SLURM, you should set cpus-per-task to what you want. e.g.:
for the interactive mode, start the session with
srun --qos interactive --cpus-per-task {number of CPUs you need} --threads-per-core 1 --pty /bin/bash
Then run ribodetector_cpu -t {number of CPUs you need} .... The current version of ribodetector needs to be updated i.e. change intra_op_num_threads = 1. I will update the repo soon. If it is urgent, you can just modify intra_op_num_threads by yourself.

dawnmy · 2022-03-07T12:26:26Z

The latest release v0.2.4 solved this issue. Please update to v0.2.4 with:

pip install ribodetector -U

karl-az · 2022-03-07T12:37:08Z

Thank you very much! I will pick it up when the release reaches bioconda.

dawnmy · 2022-03-08T15:10:10Z

It is available on bioconda now.

karl-az · 2022-03-08T15:37:27Z

Can confirm that it works for me. Thank you!

dawnmy added question Further information is requested bug Something isn't working labels Feb 24, 2022

dawnmy closed this as completed Mar 2, 2022

dawnmy reopened this Mar 2, 2022

dawnmy added the onnxruntime onnxruntime on SLURM hangs label Mar 3, 2022

dawnmy added the SLURM label Mar 4, 2022

dawnmy mentioned this issue Mar 4, 2022

Memory issues with RiboDetector #9

Closed

dawnmy closed this as completed Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process hangs in CPU detection step #6

Process hangs in CPU detection step #6

sjaenick commented Feb 24, 2022 •

edited

Loading

sjaenick commented Feb 24, 2022

dawnmy commented Feb 24, 2022

sjaenick commented Feb 24, 2022

sjaenick commented Feb 25, 2022

dawnmy commented Feb 26, 2022

sjaenick commented Feb 26, 2022

dawnmy commented Feb 27, 2022

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 2, 2022

dawnmy commented Mar 2, 2022

sjaenick commented Mar 2, 2022

karl-az commented Mar 4, 2022

dawnmy commented Mar 4, 2022

sjaenick commented Mar 4, 2022

dawnmy commented Mar 4, 2022

karl-az commented Mar 7, 2022

dawnmy commented Mar 7, 2022 •

edited

Loading

karl-az commented Mar 7, 2022

dawnmy commented Mar 7, 2022

dawnmy commented Mar 7, 2022

karl-az commented Mar 7, 2022

dawnmy commented Mar 8, 2022

karl-az commented Mar 8, 2022

Process hangs in CPU detection step #6

Process hangs in CPU detection step #6

Comments

sjaenick commented Feb 24, 2022 • edited Loading

sjaenick commented Feb 24, 2022

dawnmy commented Feb 24, 2022

sjaenick commented Feb 24, 2022

sjaenick commented Feb 25, 2022

dawnmy commented Feb 26, 2022

sjaenick commented Feb 26, 2022

dawnmy commented Feb 27, 2022

dawnmy commented Mar 1, 2022 • edited Loading

dawnmy commented Mar 2, 2022

dawnmy commented Mar 2, 2022

sjaenick commented Mar 2, 2022

karl-az commented Mar 4, 2022

dawnmy commented Mar 4, 2022

sjaenick commented Mar 4, 2022

dawnmy commented Mar 4, 2022

karl-az commented Mar 7, 2022

dawnmy commented Mar 7, 2022 • edited Loading

karl-az commented Mar 7, 2022

dawnmy commented Mar 7, 2022

dawnmy commented Mar 7, 2022

karl-az commented Mar 7, 2022

dawnmy commented Mar 8, 2022

karl-az commented Mar 8, 2022

sjaenick commented Feb 24, 2022 •

edited

Loading

dawnmy commented Mar 1, 2022 •

edited

Loading

dawnmy commented Mar 7, 2022 •

edited

Loading