-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Task is always in a waiting state in the remote machine #3905
Comments
Could you try to add "nniManagerIp" in your config file and try another run? |
Thanks for your reply, I have the add "nniManagerIp" in my config file, but it doesn't work. Config file as following:
|
Is your nnimanager machine reachable by your remote machine? You can have a quick test from remote machine. This problem seems caused by can not reach the local machine from remote. |
I'm sure that the machines can reach each other, and the firewall is turned off on both machines.
|
Thanks your test, could you run it again on debug mode? And paste "nnimanager.log" to here? Set debug mode: nnictl create --config config.yml --debug |
Please check the log from nnimanger when I add --debug option.
|
@acured @guoxiaojie-schinper : I can confirm the issue on remote machine with nni 2.2 and 2.3 (and nni built from latest master 5b99b59), but works fine on nni 2.1. In particular, In particular for 2.2 and 2.3, in my case the first trial runs fine, but if I user cancel it, then the second trial just keeps waiting and on the NNIManager log I see
Is it possible to re-open this issue so that the problem can be addressed? Thank you very much. |
I turned off the problem by mistake, I re-open this issue, thanks |
Hi, @albertogilramos Thanks your feedback, and thanks @guoxiaojie-schinper 's debug log. @albertogilramos could you give me more information about how do you user cancel the trial? I can not reproduce it now. |
BTW, there is a fix related on GPU release issue #3941. May solve our problem in next nni version. |
@acured : On NNI 2.3 remote mode, I'm using the following pytorch minimal example linear regression on just one machine (with one gpu) that acts as the master and the slave for testing, in particular note in config.yml that the (Note I've not yet had the change to build from master after #3941 to see if that fixes the issue. I'll try to do so this evening.) config.yml:
search_space.json:
main.py:
|
@acured: I confirm my issue is solved in latest master (442342c). Specifically, I installed nni nighly via
after which the trials don't stay waiting forever even if I user cancel them as can be seen in the picture: @guoxiaojie-schinper : perhaps you want to try also this version of nni from master and see if that solves your problem as well? Thank you very much. |
I have updated the nni and installed the latest code by |
@guoxiaojie-schinper : in case it helps, your command (python setup.py develop) installs it in dev mode, whereas mine installed it in persistent mode via a wheel (see above). See https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#installation Also this needs to be done in each master and slave machines. |
Thanks very much for your quick reply. I have used the following command to update the NNI in both Master and Slave machines just as you recommend.
But it still can't work for me. And the latest wheel was released on June 15, 2021, on pypi.org, so I think it was not the newest version. Is there something wrong with my understanding? |
@guoxiaojie-schinper : if you use
you'll get the latest wheel version from pypi (https://pypi.org/project/nni/) which is 2.3 (https://pypi.org/project/nni/#history), but what you want is to rather than downloading from pypi instead build the wheel yourself from the latest master for which there is no pypi package (nni doesn't release nightly versions on pypi). So if you want to reproduce what worked for me then you need to do the following on your master and slaves:
This will clone the repo, checkout a commit that is ahead of 2.3 and worked for me, create the wheel package yourself and finally install it. Hope this helps. |
Thanks for your reply, but it still doesn't work for me. I think this version just solves the problem of "Bug in IP detection", not my issue. In the new version, it can correctly detect the IP address, however in the pre-version, if I don't set nniManagerIp in configure file, it will throw out "Job management error: getIPV4Address() failed because os.networkInterfaces().eth0 is undefined". |
Thanks @albertogilramos , I'm glad this solves your problem. Hi @guoxiaojie-schinper , this fix is not in the release build. If you want to try latest code, you can install NNI from source code. ref here: https://nni.readthedocs.io/en/stable/Tutorial/InstallationLinux.html#install-nni-through-source-code Or you can wait the next version of NNI released. |
I am very happy to see someone have the same problem. In brief, what happened to me is the same as @guoxiaojie-schinper I am running the demo of nni repo, /example/trial/mnist-pytorch If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal. But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn't work exactly same as @guoxiaojie-schinper . In detail, Environment: NNI on both local and remote are install by config_remote.yml (if use remote):
config_remote.yml (if use in local):
Description:
-> 2.1 If I set the -> 2.2 If I set the -> 2.3 If I set the -> 2.4 If I set the -> 2.5 If I set the |
Closing as fixed on #4035 |
Describe the issue:
I use nnictl to create a task to schedule to a remote machine, and the GPU on the remote machine is also sufficient, but it keeps prompting the following message.
[2021-07-06 17:09:47] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
GPU information in the remote machine:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:1A:00.0 Off | N/A |
| 0% 47C P8 36W / 370W | 5MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:68:00.0 Off | N/A |
| 0% 46C P8 26W / 370W | 19MiB / 24265MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1085 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1085 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1326 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+
Environment:
Configuration:
Log message:
The text was updated successfully, but these errors were encountered: