Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Problem about the nni k8s service #361

Closed
xieydd opened this issue Nov 13, 2018 · 43 comments
Closed

Problem about the nni k8s service #361

xieydd opened this issue Nov 13, 2018 · 43 comments
Assignees

Comments

@xieydd
Copy link

xieydd commented Nov 13, 2018

I deployment the nni as a service in k8s cluster, but when i run a experiment successfully, and i can see the log of trial, it has the accuracy. but the dashboard show all trial accuracy is 0,and the metric in dashboard is NaN.
@QuanluZhang

@QuanluZhang
Copy link
Contributor

you have to make the port (default 8080) of nni restserver/webui available outside k8s.

@xieydd
Copy link
Author

xieydd commented Nov 13, 2018

@QuanluZhang Can i change the default port of restserver/webui, if i can , where is the config? Thanks a lot.

@QuanluZhang
Copy link
Contributor

please run command nnictl create --help, it shows how to set port.

@xieydd
Copy link
Author

xieydd commented Nov 13, 2018

@QuanluZhang I got it . Thanks for you reply.

@xieydd xieydd closed this as completed Nov 13, 2018
@xieydd xieydd reopened this Nov 14, 2018
@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

@QuanluZhang When i set the same port of nnictl port and service port , but the same problem .

@QuanluZhang
Copy link
Contributor

what command did you use to create an experiment? And, can you describe how do you use k8s in detail?

@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

apiVersion: v1
kind: Service
metadata:
  name: nni-service
  labels:
    app: nni
spec: 
  selector:
    app: nni
  ports:
  - port: 30018
    protocol: TCP
    nodePort: 30018
  type: NodePort
nnictl create --config config.yml --port 30018

@QuanluZhang

@yds05
Copy link
Contributor

yds05 commented Nov 14, 2018

Hi, @xieydd , do you use ReplicaSet to launch NNI pod? If yes, could you please also paset your resource definition file for that ReplicaSet? Thanks.

@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

@yds05 I use the deployment. i delete the volume and volumeMount , may be doesn`t matter.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nni
  labels:
    app: nni
spec:
  replicas: 1   
  template:
    metadata:
      labels:
        app: nni
    spec:
      containers:
      - image: nni:latest
        name : nni
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        command: 
          - sleep
          - 360d
        ports:
            containerPort: 30018

@yds05
Copy link
Contributor

yds05 commented Nov 14, 2018

Got it, thanks. The deployment looks good. Another question is which trainingServicePlatform do you set in your config.yml?

If it's set to local, maybe it's a known issue which we already fixed in NNI v0.3.4. you can use msranni/nni:v0.3.4 docker image to have a try.

@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

I set it as local, is something wrong?

@yds05
Copy link
Contributor

yds05 commented Nov 14, 2018

Refer this PR for more detail:
#273

Because ts.tail has bug to monitor file change in Docker container, so once you run NNI in docker container, metrics may not be collected correctly without that PR change.

You can rebuild your docker image based on our latest Dockerfile https://github.com/Microsoft/nni/blob/master/deployment/docker/Dockerfile, or use msranni/nni:v0.3.4 directly.

@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

Thanks a lot @yds05 , i will test the new image.

@xieydd
Copy link
Author

xieydd commented Nov 14, 2018

@yds05 I test it , not work, the accuracy is still 0.000000 , Dafault Metric is NaN , Status is SUCCEEDED, and i can see in /root/nni/experiments/b64nBFl4/trails/wcjq2/trial.log , the test accuracy is 0.9599

i think it may result from url file://localhost:/root/...

@yds05
Copy link
Contributor

yds05 commented Nov 15, 2018

It looks weird. Could you please share your experiment's log file and sqlite file for us to diagnostic?

The file's location is:
~/nni/experiment/{your_exp_id}/log
~/nni/experiment/{your_exp_id}/db

Thanks.

@xieydd
Copy link
Author

xieydd commented Nov 18, 2018

@yds05 I am sorry, i am holiday these days, i will paste the log tomorrow , thanks a lot .

@xieydd
Copy link
Author

xieydd commented Nov 19, 2018

dispatcher.log

nnimanager.log

If you need nni.sqlite i can send it to your email. @yds05

@yds05
Copy link
Contributor

yds05 commented Nov 20, 2018

@xieydd , That's fine and thanks for providing these two log files.
I noticed that your trial config is :

"trial_config": {
    "gpuNum": 0,
    "codeDir": "/tmp/nni/examples/trials/mnist/.",
    "command": "python3 mnist.py"
}

Can you make sure the examples folder is built into your docker image? And also, could you please also provide your nni experiment config file?

Besides that, I didn't find any error directly related to your issue in these log files. And I think I can try to reproduce your issue by staring K8S service to run NNI

@xieydd
Copy link
Author

xieydd commented Nov 20, 2018

the folder is in container.
@yds05 I use the example of yours /tmp/nni/examples/trails/mnist, i just change mnist.py `s datadir

@yds05
Copy link
Contributor

yds05 commented Nov 20, 2018

I see. I will use your config to try to reproduce the issue

@xieydd
Copy link
Author

xieydd commented Nov 20, 2018

Thanks a lot @yds05

@xieydd
Copy link
Author

xieydd commented Nov 22, 2018

@yds05 Have you reproduced the issue?

@yds05
Copy link
Contributor

yds05 commented Nov 26, 2018

Hi, @xieydd , sorry for late response.

I tried to run nnictl as a K8S deployment on my cluster, and I can get intermediate result successfully.
Here are my deployment.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nni
  labels:
    app: nni
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: nni
    spec:
      containers:
      - image: fishyds/nni:master-github
        name : nni
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        command:
          - /bin/sh
          - -c
        args:
          - nnictl create --port '30018' -c /tmp/nni/examples/trials/mnist/config.yml;
            sleep 1000;
        ports:
          - containerPort: 30018

And, this is my service yaml:

apiVersion: v1
kind: Service
metadata:
  name: nni-service
  labels:
    app: nni
spec:
  selector:
    app: nni
  ports:
  - port: 30018
    protocol: TCP
    nodePort: 30018
  type: NodePort

Could you please use my resource definition to have a try?

@xieydd
Copy link
Author

xieydd commented Nov 26, 2018

it really confuse me , i use your resource definition and the same image, but is also the error, @yds05

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

@xieydd It's weird. I will find another K8S cluster to check.

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

@yds05 Another question, can you redirect to the tensorboard?

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

En, I can't... but I think they're two kinds of issues, and by the way, tensorboard will be disabled from NNI WebUI in next release (v0.4) because we think WebUI should provide general functions and tensorboard is a specific function towards TF. However, you can still launch tensorboard through nnictl CLI from next release(v0.4)

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

@xieydd I tried on another K8S cluster. My resources definition works well, too. Here are the version info of these two K8S clusters, for your reference.
K8S Cluster 1:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

K8S Cluster 2:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:43:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

And also, could you please provide nnimanager's log again for us to check?

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

@yds05 👍 , right route.

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

@yds05 Thanks a lot, i will check my k8s cluster carefully.

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

nnimanager.log

@yds05 this is my nnimanager`s log

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

@xieydd Thanks.

I checked the log file, and find indeed there's no metric is recorded in the log.

Actually, for local mode experiment, we write metrics data into 'metrics' file.
You can run

kubectl exec -it {your_pod_name} /bin/bash

to enter into your pod. Then goto your trial's working directory, like

cd /root/nni/experiments/{your_experiment_id}/trials/{your_trial_id}

list (like ls -al) the file/directories in trial's working directory. Check if there is hidden directory called .nni. Then goto .nni directory, check if there are files named 'metrics' and 'metrics_offset'

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

yep, i can see metrics state sequence_id,but i can`t see the metrics_offset. @yds05

#metrics
ME000112{"sequence": 0, "value": 0.95, "trial_job_id......"}

#state
0 1543299009687

#sequence_id
1

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

271543299619_ pic
@yds05

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

Oh, my fault. Local mode doesn't have metrics_offset file under .nni directory. so your container's file tree:
.nni

  • metrics
  • sequence_id
  • state

is correct.

May I know your docker version?

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

my docker

version 17.09.1-ce
api version: 1.32
go version: go 1.8.3

my k8s cluster version is

v1.11.1-beta

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

@xieydd, I think I found the root cause. It's our fault: PR #273 is targeted into v0.2 branch, but we forgot to merge it into v0.3 and master branch. So the issue (metrics lost in some docker container) still exists in NNI v0.3 release.

I already sent out a PR #400 to fix that issue on master branch, and built docker image on fishyds/nni:master-github. I think it should work now on your machine.

So could you please re-use my previous deployment yaml file to restart the deployment and have a try?

Btw: my docker image is just a devel-preview, so please expect there're some issues(but not this issue) since this version is still under development.

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

@yds05 You just update your image?
I will try , thanks a lot 👍

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

Yes, I just update my image. Please have a try

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

@yds05 I can see download button to download the json file, i can see the data.
Haha, Can you give me a progress chart about a stable version ? @yds05

@yds05
Copy link
Contributor

yds05 commented Nov 27, 2018

@xieydd Congrats!

We will release NNI v0.4 next week, including pypi package and docker image.

@xieydd
Copy link
Author

xieydd commented Nov 27, 2018

Very Thanks.

@xieydd xieydd closed this as completed Nov 27, 2018
@yds05
Copy link
Contributor

yds05 commented Dec 6, 2018

@xieydd , we released NNI v0.4 yesterday and your issue is fixed in this release. You can use our latest docker image msranni/nni:v0.4 to verify. Thanks.

@scarlett2018 scarlett2018 added bug Something isn't working kubeflow Training Service and removed investigation question Further information is requested labels Apr 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants