-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process hangs in CPU detection step #6
Comments
It seems to be CPU-related somehow: Hangs on:
Seems to work:
|
Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime. |
Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python). |
Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue. |
That is great to hear. Could you run |
Not a virtual environment, so the list is a little bit longer..
|
Thank you for sharing the package version list. will update the package soon |
Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:
Hope this update will work without any issue. |
I will close this issue as it seems to be solved. |
Nothing but |
Hi I have a similar issue (also related to onnxruntime), but it results in another error:
Let me know if this goes into a separate issue. I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2). The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel. I found this microsoft/onnxruntime#8313, hinting towards that at least my error could have something to do with sleeping CPUs. |
@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: microsoft/onnxruntime#10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo). |
For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job. |
Yes, I tried this as well with |
I'm able to resolve this by uncommenting this row: RiboDetector/ribodetector/detect_cpu.py Line 75 in a49a054
I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed) As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs. |
@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in microsoft/onnxruntime#10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs ( |
I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else? |
Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set So if you run |
The latest release
|
Thank you very much! I will pick it up when the release reaches bioconda. |
It is available on bioconda now. |
Can confirm that it works for me. Thank you! |
python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset;
process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C
(needs to be killed instead).
When invoked with
python -m trace --trace
, it seems to get stuck in the CPU detection step:I added some debugging output to verify
self.model_file
, which correctly points to theribodetector_600k_variable_len70_101_epoch47.onnx
file.Any ideas?
The text was updated successfully, but these errors were encountered: