-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib experiments don't work with control-plane label #4730
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
I think this is working as intended. Katib jobs should be running in profile; i.e. user namespaces and not the system "kubeflow" namespace. The control-plane label is used to prevent admission hooks from being applied to the namespace where the admission controller is running. This is to prevent deadlocks. For example, suppose you have an admission webhook in Kubeflow namespace that is configured to reject a pod on error. Now imagine that webhook is configured but its pods aren't running (e.g. they got preempted). K8s will try to create those pods triggering the webhook which will return an error because the pods aren't running. So to prevent this webhooks should never apply to the namespace where the controller itself is running. |
Thank you for your answer @jlewi. Actually, when I create profile and submit Katib job in created namespace, webhook works and adds Metric Collector Spec to the Experiment. Also, unfortunately in the Training Container pod I saw these errors:
Any thoughts @hougangliu @johnugeorge ? |
Is the istio sidecar injected into the training pod? |
Yes. |
With @krishnadurai help we figure out the problem. |
@yeah waiting until ISTIO side cars are ready seems like a problem for a lot of workloads. I'm not sure what a good solution is. Adding retries/wait logic to every program seems bad. @andreyvelich It would be nice to prevent people from trying to submit jobs to the kubeflow namespace. Any thoughts on what the best way to do that would be? Perhaps we should add a validating webhook? |
@jlewi What do you think about adding annotation Yes, validating webhook sounds good to me. What do you think @johnugeorge @gaocegege ? |
Sounds good to me |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@andreyvelich Is this issue still relevant? |
@jlewi We fixed problem with istio sidecar on Training Containers (kubeflow/katib#1050), but we didn't add validation webhook to prevent users submitting Experiment in Kubeflow namespace. Some of the users use Katib without other Kubeflow components, so they can submit Experiment in Kubeflow namespace. |
The intent is for the If what you really mean is a "standalone" deployment of katib as opposed to an install of Kubeflow then its kind of up to you. I guess my question would be why would the standalone, single deployment of Katib install and configure katib in namespace configured to be the control plane for Kubeflow. If you are trying to figure out some backwards compatible way to do support users who want to continue running in kubeflow namespace then I would probably look at various options for manual customization; e.g. manually changing kubeflow namespace labels or the selector on the katib admission hook. |
SGTM |
I think users want to have standalone Katib deployment only if they installed it without other Kubeflow components. So it is not necessary to change Kubeflow namespace labels. In that situation our validation webhook can work this way:
What do you think @jlewi @gaocegege @johnugeorge ? |
@andreyvelich What is KATIB_CORE_NAMESPACE? Where is your webhook being configured? I don't see it in: Are webhooks being created by your controller as opposed to being defined declaratively? |
It is core namespace where Katib components are deployed (https://www.kubeflow.org/docs/components/hyperparameter-tuning/env-variables/#katib-controller).
Yes, validation webhook is created by Katib controller. It runs |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
@andreyvelich What still needs to be done to be able to close this issue? |
@davidspek user still can submit Experiments in Kubeflow namespace even if Katib is installed as part of Kubeflow, because of that some of them still getting the problem with Webhooks. We can follow mechanism that I proposed here: #4730 (comment) or think about better way. |
/close There has been no activity for a long time. Please reopen if necessary. |
@juliusvonkohout: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
I used https://github.com/kubeflow/manifests/blob/master/kfdef/kfctl_k8s_istio.yaml config to deploy Kubeflow on my GCP cluster. In my Kubeflow namespace I saw 2 labels:
When I tried to run Katib Random example, it didn't work correctly. Metrics collector container was not added to training job. I think Katib validating webhooks was not working correctly.
After that I tried to delete
control-plane=kubeflow
label from kubeflow namespace and experiment was running right.Do you know how we use
control-plane=kubeflow
label and why it can affect on Katib webhooks?/cc @jlewi @richardsliu @johnugeorge @hougangliu @gaocegege
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Related kubeflow/katib#1033.
The text was updated successfully, but these errors were encountered: