-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System seemed stopped during refine #35
Comments
I seems like have the same problem. I wonder if you have solved? |
I found this similar issue: keras-team/keras#11603, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu |
Dear Dr. Liu,
Thank you very much! I have found one slower machine in the laboratory that could run IsoNet properly. Meanwhile I will compare the driver versions with the cluster (which got the problem) to see if we could fix it.
Best regards,
Chris
…________________________________
From: Yuntao Liu ***@***.***>
Sent: 08 December 2022 18:33
To: IsoNet-cryoET/IsoNet ***@***.***>
Cc: Lo, Chris ***@***.***>; Author ***@***.***>
Subject: Re: [IsoNet-cryoET/IsoNet] System seemed stopped during refine (Issue #35)
This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
I found this similar issue: keras-team/keras#11603<keras-team/keras#11603>, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A4KAEM2VXZ7LKKGKNO6G5F3WMISYBANCNFSM6AAAAAASLMS7CU>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I finally found use |
Hi all, a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10. Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:
which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs. OS: Ubuntu 20.04.5 Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related. Best, |
Oh, I finally solved this. I found the exactly same error at last. |
Hi @LianghaoZhao, Thank you for reporting your bug fixation. Would you like review your code in your fork and create a pull request so that it can be merged to the master branch |
Thank you very much for the bug-fix, @LianghaoZhao. I just tried out the newest commit, and it works just fine with multiple GPUs. |
Hi there,
I am a student new to cryo-EM. I am now trying to apply IsoNet on the analysis of my data and I have encountered a problem.
I found that it took extremely long time in the refine step without any response or error messages.
The slurm after 4 hours of running still in Epoch 1/10 stage [as (1) below]. I repeated running with the official tutoral HIV dataset and exactly same commands and parameters according to the tutorial, and got the same problem. The code seemed still running in Epoch 1/10 even after 15 hours.
Then I have checked the GPU [nviaid-smi checked in (2) below]. The GPUs seemed are not working(?), while memory is being used. No new files were written in the waiting hours.
Would anyone give me some advice? Thank you very much!
Chris
(1) Slurm log-----------------------------------------------------------------------------------
11-25 10:58:34, INFO
######Isonet starts refining######
11-25 10:58:38, INFO Start Iteration1!
11-25 10:58:38, WARNING The results folder already exists
The old results folder will be renamed (to results~)
11-25 11:00:31, INFO Noise Level:0.0
11-25 11:01:08, INFO Done preparing subtomograms!
11-25 11:01:08, INFO Start training!
11-25 11:01:10, INFO Loaded model from disk
11-25 11:01:10, INFO begin fitting
Epoch 1/10
slurm-37178.out (END)
(2) nvidia-smi --------------------------------------------------------
Fri Nov 25 14:38:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:04:00.0 Off | N/A |
| 30% 33C P8 19W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:43:00.0 Off | N/A |
| 30% 32C P8 20W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 Off | 00000000:89:00.0 Off | N/A |
| 30% 30C P8 32W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 3090 Off | 00000000:C4:00.0 Off | N/A |
| 30% 30C P8 25W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1212666 C python3 17747MiB |
| 0 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1212666 C python3 17747MiB |
| 1 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1212666 C python3 17747MiB |
| 2 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1212666 C python3 17747MiB |
| 3 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
The text was updated successfully, but these errors were encountered: