-
Notifications
You must be signed in to change notification settings - Fork 137
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to determine that you authored the commits in this PR. Maybe you used a different email address in the git commits than was used to sign the CLA? If someone else authored these commits, then please add them to this pull request and have them confirm that they're okay with them being contributed to Google. If there are co-authors, make sure they're formatted properly. In order to pass this check, please resolve this problem and then comment ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making an initial pass.
/label cla:yes |
@animeshsingh: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@animeshsingh There's no way I'm aware of to have multiple Authors on a commit. All the commits on the branch are squashed into a single commit when the PR is merged so there would only be one author from a git history perspective. If you want to fix this PR to pass the CLA check you can just rebase your branch and squash the commits to a single commit which would have a single author. |
I'm super excited to see this coming along so quickly! |
Please update the PR description to provide information that will help the reviewers
nit: Please link to the issue tracking the operator. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
@jlewi quite a bit of the code is generated code, so its fine keeping it in this single PR IMHO. We are using operator-sdk as mentioned in design doc, and code structure is also mentioned design doc, and is generated by the sdk. /deploy: Contains all the k8s resources for deploying the operator image and crd. Issue tracking this |
It's not very clear what files are generated by the operator-sdk and which files are modified. |
@nrchakradhar there are developer instructions towards the bottom of README - this is no different than any GO project relying on kube-builder and other SDKs. |
Thanks @animeshsingh . After revisiting the PR description and reading operator.md the flow is clear. |
This is really great ! Thank you for the PR :) |
Can we add logic for setting |
@kunmingg @Tomcli @animeshsingh What is the relationship between the cluster where the kfctl operator is running and the cluster where Kubeflow is deployed? My expectation would be that the cluster where Kubeflow is actually deployed depends on the KFDef. How does that match whats' implemented in this behavior? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 19 files at r1, 1 of 2 files at r2, 1 of 3 files at r3, 1 of 5 files at r5, 2 of 5 files at r6.
Reviewable status: 13 of 20 files reviewed, 18 unresolved discussions (waiting on @animeshsingh, @gabrielwen, @kkasravi, @krishnadurai, @Tomcli, and @vpavlin)
deploy/operator.yaml, line 20 at r3 (raw file):
Previously, animeshsingh (Animesh Singh) wrote…
Let's figure out a versioning scheme and final location for this image
Should we just version it as 0.1.0?
The only other scheme that I can think of would be to try to match Kubeflow versioning which would be 1.0. But this isn't 1.0 yet so I don't think we should do that.
No major comments on my end. |
So with this version, 0.1, the expectation is that this is running on same cluster, using in-cluster-config to know about the cluster, and relies on cluster events for watcher. This is the general pattern we have seen, and can start looking KCC like pattern for the next iteration. So overall this looks good to go. |
For version 0.1 I would like the operator to be able to mimic For all intents and purposes the only difference between using the operator and running kfctl is that in the former kfctl is running in a pod on a cluster as opposed to your local machine. Unless I'm mistaken it already looks like that is what the code is doing because it is just invoking Thus the logic for determining which cluster is used should be determined by the existing kfctl libraries. I'm pointing this out mainly to avoid confusion on subsequent PRs. i.e the operator being able to deploy KF on other clusters should be considered working as intended and not a bug to be fixed in subsequent Prs. /hold cancel |
@jlewi Not necessarily. kfctl is fire and forget, whereas an Operator has a much bigger responsibility to continuously monitor, and continue to fix the deployment in background if things go wrong. That's where things differ. Please look at the code here Concept of watcher, if not relying on in-cluster config and events coming from underlying cluster, goes into then continuously polling mode, which is not conducive. Hence its not a bug, but a design pattern decision to adopt something like KCC. Most of the deployments I am aware of vis a vis Operators, the operators are running on same cluster. So IMHO this is a good first pass which covers 80% of deployment scenarios, and in subsequent iterations we start looking at remote deployments. |
/retest |
Can't the watchers be configured to point at a different cluster? Could you open an issue please to track having the operator create the cluster and deploy on that cluster? Otherwise the vast majority of KFDef specs won't be compatible the operator. |
Yes, and as I mentioned, can be followed in a separate PR? The first review comments was break it into smaller PRs - so in the same spirit it doesn't make sense to continue doubling down on this?
Have created a roadmap issue |
/retest |
1 similar comment
/retest |
/test kubeflow-kfctl-presubmit |
You might need to rebase on master |
Hi @kunmingg, I rebased my commits and the test is still failed at kfupgrade. Looks like some of the pods are unavailable after the upgrade. |
/retest |
1 similar comment
/retest |
@animeshsingh @Tomcli Yes the tests are flaky. That's because development has outpaced investments in test infrastructure. See for example: I think my question to you would be is if we are triggering unnecessary tests what improvements can we make to avoid triggering those tests? |
It looks like this PR isn't modifying existing code so using the existing include features in the test infra or by investing in an exclude mechanism like kubeflow/testing#585 we should be able to adjust the tests to avoid triggering them when the files in this PR are modified. |
remove old operator version package Apply suggestions from code review Thanks for the suggestions.
I rebased my commits and the presubmit test should be good to go now. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Woo Hoo! |
Initial code for KubeFlow Operator based on design discussions in community.
Operator code structure:
Deployment and Build instructions are located at operator.md
Design Doc for this PR: https://docs.google.com/document/d/1vNBZOM-gDMpwTbhx0EDU6lDpyUjc7vhT3bdOWWCRjdk/edit
Co-authored-by: Animesh Singh singhan@us.ibm.com
Co-authored-by: Weiqiang Zhuang wzhuang@us.ibm.com
Issue tracking: kubeflow/kubeflow#4570
This change is