Problem about the nni k8s service #361

xieydd · 2018-11-13T06:44:32Z

I deployment the nni as a service in k8s cluster, but when i run a experiment successfully, and i can see the log of trial, it has the accuracy. but the dashboard show all trial accuracy is 0,and the metric in dashboard is NaN.
@QuanluZhang

QuanluZhang · 2018-11-13T08:03:27Z

you have to make the port (default 8080) of nni restserver/webui available outside k8s.

xieydd · 2018-11-13T08:07:12Z

@QuanluZhang Can i change the default port of restserver/webui， if i can , where is the config? Thanks a lot.

QuanluZhang · 2018-11-13T08:09:38Z

please run command nnictl create --help, it shows how to set port.

xieydd · 2018-11-13T08:54:26Z

@QuanluZhang I got it . Thanks for you reply.

xieydd · 2018-11-14T01:56:26Z

@QuanluZhang When i set the same port of nnictl port and service port , but the same problem .

QuanluZhang · 2018-11-14T02:21:30Z

what command did you use to create an experiment? And, can you describe how do you use k8s in detail?

xieydd · 2018-11-14T02:25:41Z

apiVersion: v1
kind: Service
metadata:
  name: nni-service
  labels:
    app: nni
spec: 
  selector:
    app: nni
  ports:
  - port: 30018
    protocol: TCP
    nodePort: 30018
  type: NodePort

nnictl create --config config.yml --port 30018

@QuanluZhang

yds05 · 2018-11-14T03:47:02Z

Hi, @xieydd , do you use ReplicaSet to launch NNI pod? If yes, could you please also paset your resource definition file for that ReplicaSet? Thanks.

xieydd · 2018-11-14T03:52:50Z

@yds05 I use the deployment. i delete the volume and volumeMount , may be doesn`t matter.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nni
  labels:
    app: nni
spec:
  replicas: 1   
  template:
    metadata:
      labels:
        app: nni
    spec:
      containers:
      - image: nni:latest
        name : nni
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        command: 
          - sleep
          - 360d
        ports:
            containerPort: 30018

yds05 · 2018-11-14T03:57:50Z

Got it, thanks. The deployment looks good. Another question is which trainingServicePlatform do you set in your config.yml?

If it's set to local, maybe it's a known issue which we already fixed in NNI v0.3.4. you can use msranni/nni:v0.3.4 docker image to have a try.

xieydd · 2018-11-14T04:01:18Z

I set it as local, is something wrong?

yds05 · 2018-11-14T04:09:59Z

Refer this PR for more detail:
#273

Because ts.tail has bug to monitor file change in Docker container, so once you run NNI in docker container, metrics may not be collected correctly without that PR change.

You can rebuild your docker image based on our latest Dockerfile https://github.com/Microsoft/nni/blob/master/deployment/docker/Dockerfile, or use msranni/nni:v0.3.4 directly.

xieydd · 2018-11-14T05:10:24Z

Thanks a lot @yds05 , i will test the new image.

xieydd · 2018-11-14T07:27:36Z

@yds05 I test it , not work, the accuracy is still 0.000000 , Dafault Metric is NaN , Status is SUCCEEDED, and i can see in /root/nni/experiments/b64nBFl4/trails/wcjq2/trial.log , the test accuracy is 0.9599

i think it may result from url file://localhost:/root/...

yds05 · 2018-11-15T02:18:55Z

It looks weird. Could you please share your experiment's log file and sqlite file for us to diagnostic?

The file's location is:
~/nni/experiment/{your_exp_id}/log
~/nni/experiment/{your_exp_id}/db

Thanks.

xieydd · 2018-11-18T08:35:49Z

@yds05 I am sorry, i am holiday these days, i will paste the log tomorrow , thanks a lot .

xieydd · 2018-11-19T07:24:09Z

dispatcher.log

nnimanager.log

If you need nni.sqlite i can send it to your email. @yds05

yds05 · 2018-11-20T02:08:48Z

@xieydd , That's fine and thanks for providing these two log files.
I noticed that your trial config is :

"trial_config": {
    "gpuNum": 0,
    "codeDir": "/tmp/nni/examples/trials/mnist/.",
    "command": "python3 mnist.py"
}

Can you make sure the examples folder is built into your docker image? And also, could you please also provide your nni experiment config file?

Besides that, I didn't find any error directly related to your issue in these log files. And I think I can try to reproduce your issue by staring K8S service to run NNI

xieydd · 2018-11-20T03:23:13Z

the folder is in container.
@yds05 I use the example of yours /tmp/nni/examples/trails/mnist, i just change mnist.py `s datadir

yds05 · 2018-11-20T03:37:34Z

I see. I will use your config to try to reproduce the issue

xieydd · 2018-11-20T06:17:40Z

Thanks a lot @yds05

xieydd · 2018-11-22T01:20:31Z

@yds05 Have you reproduced the issue?

yds05 · 2018-11-26T10:27:50Z

Hi, @xieydd , sorry for late response.

I tried to run nnictl as a K8S deployment on my cluster, and I can get intermediate result successfully.
Here are my deployment.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nni
  labels:
    app: nni
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: nni
    spec:
      containers:
      - image: fishyds/nni:master-github
        name : nni
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        command:
          - /bin/sh
          - -c
        args:
          - nnictl create --port '30018' -c /tmp/nni/examples/trials/mnist/config.yml;
            sleep 1000;
        ports:
          - containerPort: 30018

And, this is my service yaml:

apiVersion: v1
kind: Service
metadata:
  name: nni-service
  labels:
    app: nni
spec:
  selector:
    app: nni
  ports:
  - port: 30018
    protocol: TCP
    nodePort: 30018
  type: NodePort

Could you please use my resource definition to have a try?

xieydd · 2018-11-26T12:28:40Z

it really confuse me , i use your resource definition and the same image, but is also the error, @yds05

yds05 · 2018-11-27T02:08:43Z

@xieydd It's weird. I will find another K8S cluster to check.

xieydd · 2018-11-27T02:20:27Z

@yds05 Another question, can you redirect to the tensorboard?

yds05 · 2018-11-27T02:26:13Z

En, I can't... but I think they're two kinds of issues, and by the way, tensorboard will be disabled from NNI WebUI in next release (v0.4) because we think WebUI should provide general functions and tensorboard is a specific function towards TF. However, you can still launch tensorboard through nnictl CLI from next release(v0.4)

yds05 · 2018-11-27T02:29:13Z

@xieydd I tried on another K8S cluster. My resources definition works well, too. Here are the version info of these two K8S clusters, for your reference.
K8S Cluster 1:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

K8S Cluster 2:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:43:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

And also, could you please provide nnimanager's log again for us to check?

xieydd · 2018-11-27T02:29:20Z

@yds05 👍 , right route.

xieydd · 2018-11-27T02:30:37Z

@yds05 Thanks a lot, i will check my k8s cluster carefully.

xieydd · 2018-11-27T03:16:56Z

nnimanager.log

@yds05 this is my nnimanager`s log

yds05 · 2018-11-27T06:04:21Z

@xieydd Thanks.

I checked the log file, and find indeed there's no metric is recorded in the log.

Actually, for local mode experiment, we write metrics data into 'metrics' file.
You can run

kubectl exec -it {your_pod_name} /bin/bash

to enter into your pod. Then goto your trial's working directory, like

cd /root/nni/experiments/{your_experiment_id}/trials/{your_trial_id}

list (like ls -al) the file/directories in trial's working directory. Check if there is hidden directory called .nni. Then goto .nni directory, check if there are files named 'metrics' and 'metrics_offset'

xieydd · 2018-11-27T06:16:32Z

yep, i can see metrics state sequence_id,but i can`t see the metrics_offset. @yds05

#metrics
ME000112{"sequence": 0, "value": 0.95, "trial_job_id......"}

#state
0 1543299009687

#sequence_id
1

xieydd · 2018-11-27T06:21:06Z

@yds05

yds05 · 2018-11-27T06:24:46Z

Oh, my fault. Local mode doesn't have metrics_offset file under .nni directory. so your container's file tree:
.nni

metrics
sequence_id
state

is correct.

May I know your docker version?

xieydd · 2018-11-27T06:28:27Z

my docker

version 17.09.1-ce
api version: 1.32
go version: go 1.8.3

my k8s cluster version is

v1.11.1-beta

yds05 · 2018-11-27T10:03:01Z

@xieydd, I think I found the root cause. It's our fault: PR #273 is targeted into v0.2 branch, but we forgot to merge it into v0.3 and master branch. So the issue (metrics lost in some docker container) still exists in NNI v0.3 release.

I already sent out a PR #400 to fix that issue on master branch, and built docker image on fishyds/nni:master-github. I think it should work now on your machine.

So could you please re-use my previous deployment yaml file to restart the deployment and have a try?

Btw: my docker image is just a devel-preview, so please expect there're some issues(but not this issue) since this version is still under development.

xieydd · 2018-11-27T10:07:07Z

@yds05 You just update your image?
I will try , thanks a lot 👍

yds05 · 2018-11-27T10:10:07Z

Yes, I just update my image. Please have a try

xieydd · 2018-11-27T11:54:59Z

@yds05 I can see download button to download the json file, i can see the data.
Haha, Can you give me a progress chart about a stable version ? @yds05

yds05 · 2018-11-27T12:10:11Z

@xieydd Congrats!

We will release NNI v0.4 next week, including pypi package and docker image.

xieydd · 2018-11-27T12:16:20Z

Very Thanks.

yds05 · 2018-12-06T10:07:49Z

@xieydd , we released NNI v0.4 yesterday and your issue is fixed in this release. You can use our latest docker image msranni/nni:v0.4 to verify. Thanks.

xieydd closed this as completed Nov 13, 2018

xieydd reopened this Nov 14, 2018

scarlett2018 added question Further information is requested investigation user raised labels Nov 14, 2018

scarlett2018 assigned yds05 Nov 14, 2018

xieydd closed this as completed Nov 27, 2018

scarlett2018 added bug Something isn't working kubeflow Training Service and removed investigation question Further information is requested labels Apr 16, 2020

Problem about the nni k8s service #361

Problem about the nni k8s service #361

Comments

xieydd commented Nov 13, 2018 • edited Loading

QuanluZhang commented Nov 13, 2018

xieydd commented Nov 13, 2018

QuanluZhang commented Nov 13, 2018

xieydd commented Nov 13, 2018

xieydd commented Nov 14, 2018

QuanluZhang commented Nov 14, 2018

xieydd commented Nov 14, 2018 • edited Loading

yds05 commented Nov 14, 2018 • edited Loading

xieydd commented Nov 14, 2018 • edited Loading

yds05 commented Nov 14, 2018

xieydd commented Nov 14, 2018

yds05 commented Nov 14, 2018

xieydd commented Nov 14, 2018

xieydd commented Nov 14, 2018

yds05 commented Nov 15, 2018

xieydd commented Nov 18, 2018

xieydd commented Nov 19, 2018

yds05 commented Nov 20, 2018

xieydd commented Nov 20, 2018 • edited Loading

yds05 commented Nov 20, 2018

xieydd commented Nov 20, 2018

xieydd commented Nov 22, 2018

yds05 commented Nov 26, 2018 • edited Loading

xieydd commented Nov 26, 2018

yds05 commented Nov 27, 2018

xieydd commented Nov 27, 2018

yds05 commented Nov 27, 2018 • edited Loading

yds05 commented Nov 27, 2018

xieydd commented Nov 27, 2018

xieydd commented Nov 27, 2018

xieydd commented Nov 27, 2018

yds05 commented Nov 27, 2018 • edited Loading

xieydd commented Nov 27, 2018

xieydd commented Nov 27, 2018

yds05 commented Nov 27, 2018

xieydd commented Nov 27, 2018 • edited Loading

yds05 commented Nov 27, 2018 • edited Loading

xieydd commented Nov 27, 2018

yds05 commented Nov 27, 2018

xieydd commented Nov 27, 2018 • edited Loading

yds05 commented Nov 27, 2018

xieydd commented Nov 27, 2018

yds05 commented Dec 6, 2018

xieydd commented Nov 13, 2018 •

edited

Loading

xieydd commented Nov 14, 2018 •

edited

Loading

yds05 commented Nov 14, 2018 •

edited

Loading

xieydd commented Nov 14, 2018 •

edited

Loading

xieydd commented Nov 20, 2018 •

edited

Loading

yds05 commented Nov 26, 2018 •

edited

Loading

yds05 commented Nov 27, 2018 •

edited

Loading

yds05 commented Nov 27, 2018 •

edited

Loading

xieydd commented Nov 27, 2018 •

edited

Loading

yds05 commented Nov 27, 2018 •

edited

Loading

xieydd commented Nov 27, 2018 •

edited

Loading