Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

It cannot work with nni-0.9 version but work with nni-0.7 version #1375

Closed
peipei-pig opened this issue Jul 26, 2019 · 10 comments
Closed

It cannot work with nni-0.9 version but work with nni-0.7 version #1375

peipei-pig opened this issue Jul 26, 2019 · 10 comments

Comments

@peipei-pig
Copy link

I use nni-0.7 version it can work very well.But today I update new version:
(1) pip uninstall nni-0.7-py3-none-manylinux1_x86_64.whl
(2) pip install nni-0.9-py3-none-manylinux1_x86_64.whl
then I run script cmd: nnictl create --config config.xml

I find some errors in nnimanager.log.
[2019-7-26 15:01:00] INFO [ 'Starting experiment: UCMnCLlE' ]
[2019-7-26 15:01:00] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[2019-7-26 15:01:00] INFO [ 'Add event listeners' ]
[2019-7-26 15:01:00] INFO [ 'Run local machine training service.' ]
[2019-7-26 15:01:00] WARNING [ 'gpu_metrics file does not exist!' ]
[2019-7-26 15:01:01] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[2019-7-26 15:01:05] ERROR [ 'Read GPU summary failed with error: ',
SyntaxError: Unexpected end of JSON input
at JSON.parse ()
at GPUScheduler.updateGPUSummary (/usr/local/miniconda3/nni/training_service/local/gpuScheduler.js:69:40) ]
[2019-7-26 15:01:10] ERROR [ 'Error: This socket has been ended by the other party\n at Socket.writeAfterFIN [as write] (net.js:402:12)\n at IpcInterface.sendCommand (/usr/local/minic
onda3/nni/core/ipcInterface.js:47:38)\n at NNIManager.pingDispatcher (/usr/local/miniconda3/nni/core/nnimanager.js:282:29)' ]
[2019-7-26 15:01:10] INFO [ 'Change NNIManager status from: RUNNING to: ERROR' ]

If I reinstall nni-0.7 version,it can work well again. How can I solve this problem?

nni Environment:

  • nni version:nni-0.9-py3-none-manylinux1_x86_64.whl
  • nni mode(local|pai|remote):local
  • OS:ubuntu16.04
  • python version:Python 3.6.5 :: Anaconda, Inc.
  • is conda or virtualenv used?:conda
  • is running in docker?:yes
@liuzhe-lz
Copy link
Contributor

liuzhe-lz commented Jul 29, 2019

Thanks for your feedback.
Please provide your dispatcher's log (dispatcher.log in the same folder of nnimanager.log) and gpu metrics (/tmp/nni/script/gpu_metrics).
If the gpu_metrics file is too large, please provide the last few lines and it's size.

@LikunYDev
Copy link

Hello,
I experienced similar issue after updating.:

ERROR [ 'Read GPU summary failed with error: ',
SyntaxError: Unexpected end of JSON input
at JSON.parse ()
at GPUScheduler.updateGPUSummary (/home/user/anaconda3/nni/training_service/local/gpuScheduler.js:69:40)
at process._tickCallback (internal/process/next_tick.js:68:7) ]

FYI, my dispatcher's log is
[08/05/2019, 07:00:19 AM] INFO (nni.msg_dispatcher_base/MainThread) Start dispatcher
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002740 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002775 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002518 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002631 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:02:12 AM] INFO (nni.msg_dispatcher_base/MainThread) Dispatcher exiting...
[08/05/2019, 07:02:13 AM] INFO (nni.msg_dispatcher_base/MainThread) Terminated by NNI manager

and there is no directory named 'script' under folder '/tmp/nni'

Thank you for you help!

@liuzhe-lz
Copy link
Contributor

Hello,
I experienced similar issue after updating.:

ERROR [ 'Read GPU summary failed with error: ',
SyntaxError: Unexpected end of JSON input
at JSON.parse ()
at GPUScheduler.updateGPUSummary (/home/user/anaconda3/nni/training_service/local/gpuScheduler.js:69:40)
at process._tickCallback (internal/process/next_tick.js:68:7) ]

FYI, my dispatcher's log is
[08/05/2019, 07:00:19 AM] INFO (nni.msg_dispatcher_base/MainThread) Start dispatcher
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002740 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002775 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002518 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002631 seconds
[08/05/2019, 07:00:19 AM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[08/05/2019, 07:02:12 AM] INFO (nni.msg_dispatcher_base/MainThread) Dispatcher exiting...
[08/05/2019, 07:02:13 AM] INFO (nni.msg_dispatcher_base/MainThread) Terminated by NNI manager

and there is no directory named 'script' under folder '/tmp/nni'

Thank you for you help!

Hi,
In theory this error only disables GPU scheduler. If NNI fails to run, there should be another critical error.
Please tell us your environment setup and NNI manager log after that line so we can analyze the real problem.
You can also try out this branch which fixes the GPU metrics issue.

@LikunYDev
Copy link

Hey,
Thank you for your reply. After fixing GPU metrics issue(By installing NVIDIA CUDA), the problem I mentioned is solved. I think the my problem was probably caused by not having proper NVIDIA driver. I don't know if that's peipei-pig's case.

Thank you for you help! Have a nice day!

@peipei-pig
Copy link
Author

@liuzhe-lz
(1)nni Environment:
nni version:nni-0.9-py3-none-manylinux1_x86_64.whl
nni mode(local|pai|remote):local
OS:ubuntu16.04
python version:Python 3.6.5 :: Anaconda, Inc.
is conda or virtualenv used?:conda
is running in docker?:yes
Driver Version: 418.67 CUDA Version: 10.1

my logs is described as follows:
(2)dispatcher.log:
[08/07/2019, 11:18:53 AM] INFO (nni.msg_dispatcher_base/MainThread) Start dispatcher
[08/07/2019, 11:18:53 AM] ERROR (nni.msg_dispatcher_base/Thread-1) ap_quniform_sampler() missing 1 required positional argument: 'q'
Traceback (most recent call last):
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/msg_dispatcher_base.py", line 102, in command_queue_worker
self.process_command(command, data)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/msg_dispatcher_base.py", line 160, in process_command
command_handlerscommand
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/msg_dispatcher.py", line 106, in handle_request_trial_jobs
params_list = self.tuner.generate_multiple_parameters(ids)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/tuner.py", line 52, in generate_multiple_parameters
res = self.generate_parameters(parameter_id, **kwargs)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/hyperopt_tuner/hyperopt_tuner.py", line 263, in generate_parameters
total_params = self.get_suggestion(random_search=False)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nni/hyperopt_tuner/hyperopt_tuner.py", line 389, in get_suggestion
new_trials = algorithm(new_ids, rval.domain, trials, random_state)
File "/usr/local/miniconda3/lib/python3.6/site-packages/hyperopt/tpe.py", line 835, in suggest
= tpe_transform(domain, prior_weight, gamma)
File "/usr/local/miniconda3/lib/python3.6/site-packages/hyperopt/tpe.py", line 816, in tpe_transform
s_prior_weight
File "/usr/local/miniconda3/lib/python3.6/site-packages/hyperopt/tpe.py", line 690, in build_posterior
b_post = fn(*b_args, **dict(named_args))
TypeError: ap_quniform_sampler() missing 1 required positional argument: 'q'
[08/07/2019, 11:18:58 AM] INFO (nni.msg_dispatcher_base/MainThread) Dispatcher exiting...
[08/07/2019, 11:18:59 AM] INFO (nni.msg_dispatcher_base/MainThread) Terminated by NNI manager

(3)nnimanager.log :
[2019-8-7 11:18:50] INFO [ 'Datastore initialization done' ]
[2019-8-7 11:18:50] INFO [ 'Rest server listening on: http://0.0.0.0:8181' ]
[2019-8-7 11:18:50] INFO [ 'RestServer start' ]
[2019-8-7 11:18:50] INFO [ 'Construct local machine training service.' ]
[2019-8-7 11:18:50] INFO [ 'RestServer base port is 8181' ]
[2019-8-7 11:18:53] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"bash xx.sh","codeDir":"/data1//.","gpuNum":1}' ]
[2019-8-7 11:18:53] INFO [ 'required GPU number is 1' ]
[2019-8-7 11:18:53] INFO [ 'Starting experiment: zpWk2oJC' ]
[2019-8-7 11:18:53] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[2019-8-7 11:18:53] INFO [ 'Add event listeners' ]
[2019-8-7 11:18:53] INFO [ 'Run local machine training service.' ]
[2019-8-7 11:18:53] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[2019-8-7 11:19:03] ERROR [ 'Error: This socket has been ended by the other party\n at Socket.writeAfterFIN [as write] (net.js:402:12)\n at IpcInterface.sendCommand (/usr/local/minico
nda3/nni/core/ipcInterface.js:47:38)\n at NNIManager.pingDispatcher (/usr/local/miniconda3/nni/core/nnimanager.js:282:29)' ]
[2019-8-7 11:19:03] INFO [ 'Change NNIManager status from: RUNNING to: ERROR' ]

@liuzhe-lz
Copy link
Contributor

liuzhe-lz commented Aug 7, 2019

Seems it's related to TPE tuner's quniform signature.
I'm not quite familiar with that part.
@QuanluZhang @suiguoxin Can you help?

@QuanluZhang
Copy link
Contributor

@suiguoxin will fix it soon

@suiguoxin
Copy link
Member

@peipeipig Please check your quniform usage, the format is {"_type":"quniform","_value":[low, high, q]}

@suiguoxin suiguoxin self-assigned this Aug 12, 2019
@scarlett2018
Copy link
Member

@peipei-pig any feedback? are you still having the issue on your side? thanks.

@peipei-pig
Copy link
Author

@scarlett2018 it was solved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants