Skip to content
This repository has been archived by the owner on Aug 17, 2023. It is now read-only.

Initial operator code #189

Merged
merged 4 commits into from
Jan 30, 2020
Merged

Initial operator code #189

merged 4 commits into from
Jan 30, 2020

Conversation

Tomcli
Copy link
Member

@Tomcli Tomcli commented Jan 22, 2020

Initial code for KubeFlow Operator based on design discussions in community.

Operator code structure:

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Deployment and Build instructions are located at operator.md

Design Doc for this PR: https://docs.google.com/document/d/1vNBZOM-gDMpwTbhx0EDU6lDpyUjc7vhT3bdOWWCRjdk/edit

Co-authored-by: Animesh Singh singhan@us.ibm.com
Co-authored-by: Weiqiang Zhuang wzhuang@us.ibm.com

Issue tracking: kubeflow/kubeflow#4570


This change is Reviewable

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to determine that you authored the commits in this PR. Maybe you used a different email address in the git commits than was used to sign the CLA? If someone else authored these commits, then please add them to this pull request and have them confirm that they're okay with them being contributed to Google. If there are co-authors, make sure they're formatted properly.

In order to pass this check, please resolve this problem and then comment@googlebot I fixed it... If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@Tomcli
Copy link
Member Author

Tomcli commented Jan 22, 2020

cc @animeshsingh @adrian555

Copy link
Contributor

@animeshsingh animeshsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an initial pass.

@Tomcli Tomcli changed the title Initial operator code [WIP] Initial operator code Jan 22, 2020
@animeshsingh
Copy link
Contributor

/label cla:yes

@k8s-ci-robot
Copy link
Contributor

@animeshsingh: The label(s) /label cla:yes cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/label cla:yes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2020

@animeshsingh There's no way I'm aware of to have multiple Authors on a commit. All the commits on the branch are squashed into a single commit when the PR is merged so there would only be one author from a git history perspective.

If you want to fix this PR to pass the CLA check you can just rebase your branch and squash the commits to a single commit which would have a single author.

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2020

I'm super excited to see this coming along so quickly!

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2020

Please update the PR description to provide information that will help the reviewers

  • E.g. what framework if any is being used to generate the operator?

  • What the proposed file structure is?

    • e.g. Why build for the docker image?
  • If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be

    • go code
    • Docker image
    • YAML manifest

nit: Please link to the issue tracking the operator.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@animeshsingh
Copy link
Contributor

animeshsingh commented Jan 22, 2020

Please update the PR description to provide information that will help the reviewers

  • E.g. what framework if any is being used to generate the operator?

  • What the proposed file structure is?

    • e.g. Why build for the docker image?
  • If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be

    • go code
    • Docker image
    • YAML manifest

nit: Please link to the issue tracking the operator.

@jlewi quite a bit of the code is generated code, so its fine keeping it in this single PR IMHO. We are using operator-sdk as mentioned in design doc, and code structure is also mentioned design doc, and is generated by the sdk.

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Issue tracking this
kubeflow/kubeflow#4570

@Tomcli Tomcli changed the title [WIP] Initial operator code Initial operator code Jan 22, 2020
@animeshsingh
Copy link
Contributor

cc @pdmack @vpavlin

@nrchakradhar
Copy link

Please update the PR description to provide information that will help the reviewers

  • E.g. what framework if any is being used to generate the operator?

  • What the proposed file structure is?

    • e.g. Why build for the docker image?
  • If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be

    • go code
    • Docker image
    • YAML manifest

nit: Please link to the issue tracking the operator.

@jlewi quite a bit of the code is generated code, so its fine keeping it in this single PR IMHO. We are using operator-sdk as mentioned in design doc, and code structure is also mentioned design doc, and is generated by the sdk.

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Issue tracking this
kubeflow/kubeflow#4570

It's not very clear what files are generated by the operator-sdk and which files are modified.
It would be good if the steps for generated code is included here so that when new updates have to be done, the procedure will be clear.

@animeshsingh
Copy link
Contributor

It's not very clear what files are generated by the operator-sdk and which files are modified.
It would be good if the steps for generated code is included here so that when new updates have to be done, the procedure will be clear.

@nrchakradhar there are developer instructions towards the bottom of README - this is no different than any GO project relying on kube-builder and other SDKs.

@nrchakradhar
Copy link

@nrchakradhar there are developer instructions towards the bottom of README - this is no different than any GO project relying on kube-builder and other SDKs.

Thanks @animeshsingh . After revisiting the PR description and reading operator.md the flow is clear.

@swiftdiaries
Copy link
Member

This is really great ! Thank you for the PR :)

@kunmingg
Copy link
Contributor

Can we add logic for setting OwnerReference of the k8s resources owned by kfdef?

@jlewi
Copy link
Contributor

jlewi commented Jan 24, 2020

Can we add logic for setting OwnerReference of the k8s resources owned by kfdef?

@kunmingg @Tomcli @animeshsingh

What is the relationship between the cluster where the kfctl operator is running and the cluster where Kubeflow is deployed?

My expectation would be that the cluster where Kubeflow is actually deployed depends on the KFDef.
If the KFDef specifies creating a new cluster then it would be a different cluster.
However if KFDef specifies an existing cluster which happens to be the same cluster where the operator is running then it would be installed on that cluster.

How does that match whats' implemented in this behavior?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 19 files at r1, 1 of 2 files at r2, 1 of 3 files at r3, 1 of 5 files at r5, 2 of 5 files at r6.
Reviewable status: 13 of 20 files reviewed, 18 unresolved discussions (waiting on @animeshsingh, @gabrielwen, @kkasravi, @krishnadurai, @Tomcli, and @vpavlin)


deploy/operator.yaml, line 20 at r3 (raw file):

Previously, animeshsingh (Animesh Singh) wrote…

Let's figure out a versioning scheme and final location for this image

Should we just version it as 0.1.0?
The only other scheme that I can think of would be to try to match Kubeflow versioning which would be 1.0. But this isn't 1.0 yet so I don't think we should do that.

@jlewi
Copy link
Contributor

jlewi commented Jan 24, 2020

No major comments on my end.
Overall this looks pretty good.

@animeshsingh
Copy link
Contributor

This is the pattern GCP is adoping with KCC
https://cloud.google.com/config-connector/docs/overview

What about the current implementation assumes that Kubeflow is being deployed onto the current cluster?

It looks like the operator is just calling kfctl Apply.

So with this version, 0.1, the expectation is that this is running on same cluster, using in-cluster-config to know about the cluster, and relies on cluster events for watcher. This is the general pattern we have seen, and can start looking KCC like pattern for the next iteration.

So overall this looks good to go.

@jlewi
Copy link
Contributor

jlewi commented Jan 28, 2020

So with this version, 0.1, the expectation is that this is running on same cluster, using in-cluster-config to know about the cluster, and relies on cluster events for watcher. This is the general pattern we have seen, and can start looking KCC like pattern for the next iteration.

For version 0.1 I would like the operator to be able to mimic kfctl behavior which would essentially mean being to create and deploy on clusters different from the operator is running.

For all intents and purposes the only difference between using the operator and running kfctl is that in the former kfctl is running in a pod on a cluster as opposed to your local machine.

Unless I'm mistaken it already looks like that is what the code is doing because it is just invoking
KFDef.Apply
https://github.com/kubeflow/kfctl/pull/189/files#diff-add365d9fedf7b3665f97780bce9f92fR228

Thus the logic for determining which cluster is used should be determined by the existing kfctl libraries.

I'm pointing this out mainly to avoid confusion on subsequent PRs. i.e the operator being able to deploy KF on other clusters should be considered working as intended and not a bug to be fixed in subsequent Prs.

/hold cancel

@animeshsingh
Copy link
Contributor

animeshsingh commented Jan 28, 2020

For all intents and purposes the only difference between using the operator and running kfctl is that in the former kfctl is running in a pod on a cluster as opposed to your local machine.

@jlewi Not necessarily. kfctl is fire and forget, whereas an Operator has a much bigger responsibility to continuously monitor, and continue to fix the deployment in background if things go wrong. That's where things differ. Please look at the code here
https://github.com/kubeflow/kfctl/pull/189/files#diff-add365d9fedf7b3665f97780bce9f92fR82

Concept of watcher, if not relying on in-cluster config and events coming from underlying cluster, goes into then continuously polling mode, which is not conducive. Hence its not a bug, but a design pattern decision to adopt something like KCC. Most of the deployments I am aware of vis a vis Operators, the operators are running on same cluster. So IMHO this is a good first pass which covers 80% of deployment scenarios, and in subsequent iterations we start looking at remote deployments.

@animeshsingh
Copy link
Contributor

/retest

@jlewi
Copy link
Contributor

jlewi commented Jan 28, 2020

Can't the watchers be configured to point at a different cluster? Could you open an issue please to track having the operator create the cluster and deploy on that cluster? Otherwise the vast majority of KFDef specs won't be compatible the operator.

@animeshsingh
Copy link
Contributor

Can't the watchers be configured to point at a different cluster?

Yes, and as I mentioned, can be followed in a separate PR? The first review comments was break it into smaller PRs - so in the same spirit it doesn't make sense to continue doubling down on this?

Could you open an issue please to track having the operator create the cluster and deploy on that cluster? Otherwise the vast majority of KFDef specs won't be compatible the operator.

Have created a roadmap issue
#193

@Tomcli
Copy link
Member Author

Tomcli commented Jan 28, 2020

/retest

1 similar comment
@animeshsingh
Copy link
Contributor

/retest

@Tomcli
Copy link
Member Author

Tomcli commented Jan 28, 2020

/test kubeflow-kfctl-presubmit

@kunmingg
Copy link
Contributor

You might need to rebase on master

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 28, 2020
@Tomcli
Copy link
Member Author

Tomcli commented Jan 28, 2020

Hi @kunmingg, I rebased my commits and the test is still failed at kfupgrade. Looks like some of the pods are unavailable after the upgrade.

@kunmingg
Copy link
Contributor

/retest

1 similar comment
@animeshsingh
Copy link
Contributor

/retest

@animeshsingh
Copy link
Contributor

@kunmingg @jlewi if it fails even this time, we would appreciate some help. The tests here have been behaving very flaky

@jlewi
Copy link
Contributor

jlewi commented Jan 29, 2020

@animeshsingh @Tomcli Yes the tests are flaky. That's because development has outpaced investments in test infrastructure. See for example:

#57
kubeflow/testing#585

I think my question to you would be is if we are triggering unnecessary tests what improvements can we make to avoid triggering those tests?

@jlewi
Copy link
Contributor

jlewi commented Jan 29, 2020

It looks like this PR isn't modifying existing code so using the existing include features in the test infra or by investing in an exclude mechanism like kubeflow/testing#585

we should be able to adjust the tests to avoid triggering them when the files in this PR are modified.

@Tomcli
Copy link
Member Author

Tomcli commented Jan 30, 2020

I rebased my commits and the presubmit test should be good to go now.

@jlewi
Copy link
Contributor

jlewi commented Jan 30, 2020

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor

jlewi commented Jan 30, 2020

Woo Hoo!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants