-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Problem about the nni k8s service #361
Comments
you have to make the port (default 8080) of nni restserver/webui available outside k8s. |
@QuanluZhang Can i change the default port of restserver/webui, if i can , where is the config? Thanks a lot. |
please run command |
@QuanluZhang I got it . Thanks for you reply. |
@QuanluZhang When i set the same port of nnictl port and service port , but the same problem . |
what command did you use to create an experiment? And, can you describe how do you use k8s in detail? |
|
Hi, @xieydd , do you use ReplicaSet to launch NNI pod? If yes, could you please also paset your resource definition file for that ReplicaSet? Thanks. |
@yds05 I use the deployment. i delete the volume and volumeMount , may be doesn`t matter.
|
Got it, thanks. The deployment looks good. Another question is which trainingServicePlatform do you set in your config.yml? If it's set to local, maybe it's a known issue which we already fixed in NNI v0.3.4. you can use msranni/nni:v0.3.4 docker image to have a try. |
I set it as local, is something wrong? |
Refer this PR for more detail: Because ts.tail has bug to monitor file change in Docker container, so once you run NNI in docker container, metrics may not be collected correctly without that PR change. You can rebuild your docker image based on our latest Dockerfile https://github.com/Microsoft/nni/blob/master/deployment/docker/Dockerfile, or use msranni/nni:v0.3.4 directly. |
Thanks a lot @yds05 , i will test the new image. |
@yds05 I test it , not work, the accuracy is still i think it may result from url |
It looks weird. Could you please share your experiment's log file and sqlite file for us to diagnostic? The file's location is: Thanks. |
@yds05 I am sorry, i am holiday these days, i will paste the log tomorrow , thanks a lot . |
If you need nni.sqlite i can send it to your email. @yds05 |
@xieydd , That's fine and thanks for providing these two log files.
Can you make sure the examples folder is built into your docker image? And also, could you please also provide your nni experiment config file? Besides that, I didn't find any error directly related to your issue in these log files. And I think I can try to reproduce your issue by staring K8S service to run NNI |
the folder is in container. |
I see. I will use your config to try to reproduce the issue |
Thanks a lot @yds05 |
@yds05 Have you reproduced the issue? |
Hi, @xieydd , sorry for late response. I tried to run nnictl as a K8S deployment on my cluster, and I can get intermediate result successfully.
And, this is my service yaml:
Could you please use my resource definition to have a try? |
it really confuse me , i use your resource definition and the same image, but is also the error, @yds05 |
@xieydd It's weird. I will find another K8S cluster to check. |
@yds05 Another question, can you redirect to the tensorboard? |
En, I can't... but I think they're two kinds of issues, and by the way, tensorboard will be disabled from NNI WebUI in next release (v0.4) because we think WebUI should provide general functions and tensorboard is a specific function towards TF. However, you can still launch tensorboard through nnictl CLI from next release(v0.4) |
@xieydd I tried on another K8S cluster. My resources definition works well, too. Here are the version info of these two K8S clusters, for your reference. K8S Cluster 2: And also, could you please provide nnimanager's log again for us to check? |
@yds05 👍 , right route. |
@yds05 Thanks a lot, i will check my k8s cluster carefully. |
@yds05 this is my nnimanager`s log |
@xieydd Thanks. I checked the log file, and find indeed there's no metric is recorded in the log. Actually, for local mode experiment, we write metrics data into 'metrics' file.
to enter into your pod. Then goto your trial's working directory, like
list (like |
yep, i can see
|
Oh, my fault. Local mode doesn't have metrics_offset file under .nni directory. so your container's file tree:
is correct. May I know your docker version? |
my docker
my k8s cluster version is
|
@xieydd, I think I found the root cause. It's our fault: PR #273 is targeted into v0.2 branch, but we forgot to merge it into v0.3 and master branch. So the issue (metrics lost in some docker container) still exists in NNI v0.3 release. I already sent out a PR #400 to fix that issue on master branch, and built docker image on fishyds/nni:master-github. I think it should work now on your machine. So could you please re-use my previous deployment yaml file to restart the deployment and have a try? Btw: my docker image is just a devel-preview, so please expect there're some issues(but not this issue) since this version is still under development. |
@yds05 You just update your image? |
Yes, I just update my image. Please have a try |
@xieydd Congrats! We will release NNI v0.4 next week, including pypi package and docker image. |
Very Thanks. |
@xieydd , we released NNI v0.4 yesterday and your issue is fixed in this release. You can use our latest docker image msranni/nni:v0.4 to verify. Thanks. |
I deployment the nni as a service in k8s cluster, but when i run a experiment successfully, and i can see the log of trial, it has the accuracy. but the dashboard show all trial accuracy is 0,and the metric in dashboard is NaN.
@QuanluZhang
The text was updated successfully, but these errors were encountered: