Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib experiments don't work with control-plane label #4730

Closed
andreyvelich opened this issue Feb 4, 2020 · 23 comments
Closed

Katib experiments don't work with control-plane label #4730

andreyvelich opened this issue Feb 4, 2020 · 23 comments

Comments

@andreyvelich
Copy link
Member

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

I used https://github.com/kubeflow/manifests/blob/master/kfdef/kfctl_k8s_istio.yaml config to deploy Kubeflow on my GCP cluster. In my Kubeflow namespace I saw 2 labels:

control-plane=kubeflow
katib-metricscollector-injection=enabled

When I tried to run Katib Random example, it didn't work correctly. Metrics collector container was not added to training job. I think Katib validating webhooks was not working correctly.
After that I tried to delete control-plane=kubeflow label from kubeflow namespace and experiment was running right.

Do you know how we use control-plane=kubeflow label and why it can affect on Katib webhooks?
/cc @jlewi @richardsliu @johnugeorge @hougangliu @gaocegege

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Related kubeflow/katib#1033.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.96

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor

jlewi commented Feb 5, 2020

I think this is working as intended.

Katib jobs should be running in profile; i.e. user namespaces and not the system "kubeflow" namespace.

The control-plane label is used to prevent admission hooks from being applied to the namespace where the admission controller is running. This is to prevent deadlocks.

For example, suppose you have an admission webhook in Kubeflow namespace that is configured to reject a pod on error. Now imagine that webhook is configured but its pods aren't running (e.g. they got preempted). K8s will try to create those pods triggering the webhook which will return an error because the pods aren't running.

So to prevent this webhooks should never apply to the namespace where the controller itself is running.

@andreyvelich
Copy link
Member Author

Thank you for your answer @jlewi.

Actually, when I create profile and submit Katib job in created namespace, webhook works and adds Metric Collector Spec to the Experiment.
Should we handle situation when user tries to submit Experiment in namespace where admission controller is running?

Also, unfortunately in the Training Container pod I saw these errors:

I0205 22:38:51.633430      80 main.go:79] 2020-02-05T22:38:51Z DEBUG    Starting new HTTP connection (1): yann.lecun.com:80
I0205 22:38:51.672433      80 main.go:79] Traceback (most recent call last):
I0205 22:38:51.672697      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 157, in _new_conn
I0205 22:38:51.672861      80 main.go:79]     (self._dns_host, self.port), self.timeout, **extra_kw
I0205 22:38:51.673001      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/util/connection.py", line 84, in create_connection
I0205 22:38:51.673047      80 main.go:79]     raise err
I0205 22:38:51.673060      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/util/connection.py", line 74, in create_connection
I0205 22:38:51.673065      80 main.go:79]     sock.connect(sa)
I0205 22:38:51.673080      80 main.go:79] ConnectionRefusedError: [Errno 111] Connection refused
I0205 22:38:51.673085      80 main.go:79] 
I0205 22:38:51.673102      80 main.go:79] During handling of the above exception, another exception occurred:
I0205 22:38:51.673125      80 main.go:79] 
I0205 22:38:51.673147      80 main.go:79] Traceback (most recent call last):
I0205 22:38:51.673152      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 672, in urlopen
I0205 22:38:51.673162      80 main.go:79]     chunked=chunked,
I0205 22:38:51.673167      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
I0205 22:38:51.673177      80 main.go:79]     conn.request(method, url, **httplib_request_kw)
I0205 22:38:51.673182      80 main.go:79]   File "/usr/lib/python3.5/http/client.py", line 1122, in request
I0205 22:38:51.673210      80 main.go:79]     self._send_request(method, url, body, headers)
I0205 22:38:51.673216      80 main.go:79]   File "/usr/lib/python3.5/http/client.py", line 1167, in _send_request
I0205 22:38:51.673226      80 main.go:79]     self.endheaders(body)
I0205 22:38:51.673231      80 main.go:79]   File "/usr/lib/python3.5/http/client.py", line 1118, in endheaders
I0205 22:38:51.673240      80 main.go:79]     self._send_output(message_body)
I0205 22:38:51.673246      80 main.go:79]   File "/usr/lib/python3.5/http/client.py", line 944, in _send_output
I0205 22:38:51.673256      80 main.go:79]     self.send(msg)
I0205 22:38:51.673261      80 main.go:79]   File "/usr/lib/python3.5/http/client.py", line 887, in send
I0205 22:38:51.673271      80 main.go:79]     self.connect()
I0205 22:38:51.673293      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 184, in connect
I0205 22:38:51.673304      80 main.go:79]     conn = self._new_conn()
I0205 22:38:51.673309      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 169, in _new_conn
I0205 22:38:51.673324      80 main.go:79]     self, "Failed to establish a new connection: %s" % e
I0205 22:38:51.673330      80 main.go:79] urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe0142b0ef0>: Failed to establish a new connection: [Errno 111] Connection refused
I0205 22:38:51.673349      80 main.go:79] 
I0205 22:38:51.673355      80 main.go:79] During handling of the above exception, another exception occurred:
I0205 22:38:51.673390      80 main.go:79] 
I0205 22:38:51.673396      80 main.go:79] Traceback (most recent call last):
I0205 22:38:51.673405      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
I0205 22:38:51.673411      80 main.go:79]     timeout=timeout
I0205 22:38:51.673419      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 720, in urlopen
I0205 22:38:51.673424      80 main.go:79]     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
I0205 22:38:51.673438      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 436, in increment
I0205 22:38:51.673444      80 main.go:79]     raise MaxRetryError(_pool, url, error or ResponseError(cause))
I0205 22:38:51.673475      80 main.go:79] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='yann.lecun.com', port=80): Max retries exceeded with url: /exdb/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe0142b0ef0>: Failed to establish a new connection: [Errno 111] Connection refused',))
I0205 22:38:51.673485      80 main.go:79] 
I0205 22:38:51.673500      80 main.go:79] During handling of the above exception, another exception occurred:
I0205 22:38:51.673510      80 main.go:79] 
I0205 22:38:51.673519      80 main.go:79] Traceback (most recent call last):
I0205 22:38:51.673524      80 main.go:79]   File "/opt/mxnet-mnist/mnist.py", line 102, in <module>
I0205 22:38:51.673548      80 main.go:79]     fit.fit(args, sym, get_mnist_iter)
I0205 22:38:51.673554      80 main.go:79]   File "/opt/mxnet-mnist/common/fit.py", line 182, in fit
I0205 22:38:51.673564      80 main.go:79]     (train, val) = data_loader(args, kv)
I0205 22:38:51.673569      80 main.go:79]   File "/opt/mxnet-mnist/mnist.py", line 63, in get_mnist_iter
I0205 22:38:51.673578      80 main.go:79]     'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
I0205 22:38:51.673583      80 main.go:79]   File "/opt/mxnet-mnist/mnist.py", line 43, in read_data
I0205 22:38:51.673659      80 main.go:79]     with gzip.open(utils.download_file(base_url+label, os.path.join('data',label))) as flbl:
I0205 22:38:51.673669      80 main.go:79]   File "/opt/mxnet-mnist/common/utils.py", line 41, in download_file
I0205 22:38:51.673680      80 main.go:79]     r = requests.get(url, stream=True)
I0205 22:38:51.673685      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 75, in get
I0205 22:38:51.673694      80 main.go:79]     return request('get', url, params=params, **kwargs)
I0205 22:38:51.673699      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 60, in request
I0205 22:38:51.673711      80 main.go:79]     return session.request(method=method, url=url, **kwargs)
I0205 22:38:51.673716      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 533, in request
I0205 22:38:51.673746      80 main.go:79]     resp = self.send(prep, **send_kwargs)
I0205 22:38:51.673752      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 646, in send
I0205 22:38:51.673761      80 main.go:79]     r = adapter.send(request, **kwargs)
I0205 22:38:51.673766      80 main.go:79]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
I0205 22:38:51.673791      80 main.go:79]     raise ConnectionError(e, request=request)
I0205 22:38:51.673797      80 main.go:79] requests.exceptions.ConnectionError: HTTPConnectionPool(host='yann.lecun.com', port=80): Max retries exceeded with url: /exdb/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe0142b0ef0>: Failed to establish a new connection: [Errno 111] Connection refused',))
F0205 22:38:51.847515      80 main.go:95] Failed to wait for worker container: Process 46 hadn't completed: open /var/log/katib/46.pid: no such file or directory
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc00018a100, 0xc000298000, 0xa2, 0xf7)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x129da40, 0xc000000003, 0xc00028e070, 0x12378ae, 0x7, 0x5f, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x129da40, 0x3, 0xc78f77, 0x27, 0xc000089ed8, 0x1, 0x1)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
	/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha3/file-metricscollector/main.go:95 +0x279

Any thoughts @hougangliu @johnugeorge ?

@gaocegege
Copy link
Member

Is the istio sidecar injected into the training pod?

@andreyvelich
Copy link
Member Author

andreyvelich commented Feb 6, 2020

Is the istio sidecar injected into the training pod?

Yes.
Also I tried to run requests.get(http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", stream=True) executing to the training container and it was working fine with returning 200 code. Only problem when it is running from the training code

@andreyvelich
Copy link
Member Author

With @krishnadurai help we figure out the problem.
I submit another issue: #4742

@jlewi
Copy link
Contributor

jlewi commented Feb 7, 2020

@yeah waiting until ISTIO side cars are ready seems like a problem for a lot of workloads. I'm not sure what a good solution is. Adding retries/wait logic to every program seems bad.

@andreyvelich It would be nice to prevent people from trying to submit jobs to the kubeflow namespace. Any thoughts on what the best way to do that would be?

Perhaps we should add a validating webhook?

@andreyvelich
Copy link
Member Author

@jlewi What do you think about adding annotation sidecar.istio.io/inject: "false" to all Katib's components?

Yes, validating webhook sounds good to me.

What do you think @johnugeorge @gaocegege ?

@johnugeorge
Copy link
Member

Sounds good to me

@stale
Copy link

stale bot commented May 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label May 7, 2020
@jlewi
Copy link
Contributor

jlewi commented May 16, 2020

@andreyvelich Is this issue still relevant?

@stale stale bot removed the lifecycle/stale label May 16, 2020
@andreyvelich
Copy link
Member Author

andreyvelich commented May 17, 2020

@jlewi We fixed problem with istio sidecar on Training Containers (kubeflow/katib#1050), but we didn't add validation webhook to prevent users submitting Experiment in Kubeflow namespace.

Some of the users use Katib without other Kubeflow components, so they can submit Experiment in Kubeflow namespace.
In that situation, what do you think is the best way to handle it?
/cc @johnugeorge @gaocegege

@jlewi
Copy link
Contributor

jlewi commented May 29, 2020

Some of the users use Katib without other Kubeflow components, so they can submit Experiment in Kubeflow namespace.

The intent is for the kubeflow namespace to be the control plane and by this definition not being able to run workloads in that namespace seems like its working as intended.

If what you really mean is a "standalone" deployment of katib as opposed to an install of Kubeflow then its kind of up to you. I guess my question would be why would the standalone, single deployment of Katib install and configure katib in namespace configured to be the control plane for Kubeflow.

If you are trying to figure out some backwards compatible way to do support users who want to continue running in kubeflow namespace then I would probably look at various options for manual customization; e.g. manually changing kubeflow namespace labels or the selector on the katib admission hook.

@gaocegege
Copy link
Member

If you are trying to figure out some backwards compatible way to do support users who want to continue running in kubeflow namespace then I would probably look at various options for manual customization; e.g. manually changing kubeflow namespace labels or the selector on the katib admission hook.

SGTM

@andreyvelich
Copy link
Member Author

If you are trying to figure out some backwards compatible way to do support users who want to continue running in kubeflow namespace then I would probably look at various options for manual customization; e.g. manually changing kubeflow namespace labels or the selector on the katib admission hook.

I think users want to have standalone Katib deployment only if they installed it without other Kubeflow components. So it is not necessary to change Kubeflow namespace labels.

In that situation our validation webhook can work this way:

  1. If KATIB_CORE_NAMESPACE has label control-plane=kubeflow, users can't submit Experiment in KATIB_CORE_NAMESPACE. Which, I believe, in most cases is Kubeflow (I am not sure that user can change namespace during Kubeflow installation).

  2. If KATIB_CORE_NAMESPACE doesn't have label control-plane=kubeflow, users can submit Experiment in any namespaces.

What do you think @jlewi @gaocegege @johnugeorge ?

@jlewi
Copy link
Contributor

jlewi commented Jun 2, 2020

@andreyvelich What is KATIB_CORE_NAMESPACE? Where is your webhook being configured? I don't see it in:
https://github.com/kubeflow/manifests/tree/master/katib

Are webhooks being created by your controller as opposed to being defined declaratively?

@andreyvelich
Copy link
Member Author

@andreyvelich What is KATIB_CORE_NAMESPACE?

It is core namespace where Katib components are deployed (https://www.kubeflow.org/docs/components/hyperparameter-tuning/env-variables/#katib-controller).

Where is your webhook being configured? I don't see it in:
https://github.com/kubeflow/manifests/tree/master/katib
Are webhooks being created by your controller as opposed to being defined declaratively?

Yes, validation webhook is created by Katib controller. It runs ValidateExperiment after user submits Experiment (https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L45)

@stale
Copy link

stale bot commented Aug 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

/lifecycle frozen

@davidspek
Copy link
Contributor

@andreyvelich What still needs to be done to be able to close this issue?

@andreyvelich
Copy link
Member Author

@davidspek user still can submit Experiments in Kubeflow namespace even if Katib is installed as part of Kubeflow, because of that some of them still getting the problem with Webhooks.

We can follow mechanism that I proposed here: #4730 (comment) or think about better way.

@juliusvonkohout
Copy link
Member

/close

There has been no activity for a long time. Please reopen if necessary.

@google-oss-prow
Copy link

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants