Initial operator code #189

Tomcli · 2020-01-22T01:16:09Z

Initial code for KubeFlow Operator based on design discussions in community.

Operator code structure:

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Deployment and Build instructions are located at operator.md

Design Doc for this PR: https://docs.google.com/document/d/1vNBZOM-gDMpwTbhx0EDU6lDpyUjc7vhT3bdOWWCRjdk/edit

Co-authored-by: Animesh Singh singhan@us.ibm.com
Co-authored-by: Weiqiang Zhuang wzhuang@us.ibm.com

Issue tracking: kubeflow/kubeflow#4570

This change is

googlebot · 2020-01-22T01:16:13Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to determine that you authored the commits in this PR. Maybe you used a different email address in the git commits than was used to sign the CLA? If someone else authored these commits, then please add them to this pull request and have them confirm that they're okay with them being contributed to Google. If there are co-authors, make sure they're formatted properly.

In order to pass this check, please resolve this problem and then comment@googlebot I fixed it... If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Tomcli · 2020-01-22T01:16:56Z

cc @animeshsingh @adrian555

animeshsingh

Making an initial pass.

operator.md

pkg/apis/apps/group.go

pkg/controller/kfdef/kfdef_controller.go

animeshsingh · 2020-01-22T19:34:00Z

/label cla:yes

k8s-ci-robot · 2020-01-22T19:34:01Z

@animeshsingh: The label(s) /label cla:yes cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/label cla:yes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi · 2020-01-22T19:53:15Z

@animeshsingh There's no way I'm aware of to have multiple Authors on a commit. All the commits on the branch are squashed into a single commit when the PR is merged so there would only be one author from a git history perspective.

If you want to fix this PR to pass the CLA check you can just rebase your branch and squash the commits to a single commit which would have a single author.

jlewi · 2020-01-22T19:54:06Z

I'm super excited to see this coming along so quickly!

jlewi · 2020-01-22T19:57:59Z

Please update the PR description to provide information that will help the reviewers

E.g. what framework if any is being used to generate the operator?
What the proposed file structure is?
- e.g. Why build for the docker image?
If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be
- go code
- Docker image
- YAML manifest

nit: Please link to the issue tracking the operator.

googlebot · 2020-01-22T20:03:44Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

animeshsingh · 2020-01-22T20:11:11Z

Please update the PR description to provide information that will help the reviewers

E.g. what framework if any is being used to generate the operator?

What the proposed file structure is?

e.g. Why build for the docker image?

If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be

go code

Docker image

YAML manifest

nit: Please link to the issue tracking the operator.

@jlewi quite a bit of the code is generated code, so its fine keeping it in this single PR IMHO. We are using operator-sdk as mentioned in design doc, and code structure is also mentioned design doc, and is generated by the sdk.

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Issue tracking this
kubeflow/kubeflow#4570

animeshsingh · 2020-01-22T22:48:10Z

cc @pdmack @vpavlin

operator.md

build/Dockerfile

operator.md

deploy/operator.yaml

nrchakradhar · 2020-01-23T03:07:10Z

Please update the PR description to provide information that will help the reviewers

E.g. what framework if any is being used to generate the operator?

What the proposed file structure is?

e.g. Why build for the docker image?

If possible it might be nice to break this up into smaller PRs to make this easier to review; one possible split would be

go code

Docker image

YAML manifest

nit: Please link to the issue tracking the operator.

@jlewi quite a bit of the code is generated code, so its fine keeping it in this single PR IMHO. We are using operator-sdk as mentioned in design doc, and code structure is also mentioned design doc, and is generated by the sdk.

/deploy: Contains all the k8s resources for deploying the operator image and crd.
/build: Operator image build script
/pkg/controller: main package for operator controller logic
/cmd/manager: main.go file for the operator go program.

Issue tracking this
kubeflow/kubeflow#4570

It's not very clear what files are generated by the operator-sdk and which files are modified.
It would be good if the steps for generated code is included here so that when new updates have to be done, the procedure will be clear.

animeshsingh · 2020-01-23T04:06:48Z

It's not very clear what files are generated by the operator-sdk and which files are modified.
It would be good if the steps for generated code is included here so that when new updates have to be done, the procedure will be clear.

@nrchakradhar there are developer instructions towards the bottom of README - this is no different than any GO project relying on kube-builder and other SDKs.

nrchakradhar · 2020-01-23T12:14:34Z

@nrchakradhar there are developer instructions towards the bottom of README - this is no different than any GO project relying on kube-builder and other SDKs.

Thanks @animeshsingh . After revisiting the PR description and reading operator.md the flow is clear.

deploy/operator.yaml

operator.md

swiftdiaries · 2020-01-23T21:06:38Z

This is really great ! Thank you for the PR :)

kunmingg · 2020-01-23T23:44:49Z

Can we add logic for setting OwnerReference of the k8s resources owned by kfdef?

jlewi · 2020-01-24T00:25:06Z

Can we add logic for setting OwnerReference of the k8s resources owned by kfdef?

@kunmingg @Tomcli @animeshsingh

What is the relationship between the cluster where the kfctl operator is running and the cluster where Kubeflow is deployed?

My expectation would be that the cluster where Kubeflow is actually deployed depends on the KFDef.
If the KFDef specifies creating a new cluster then it would be a different cluster.
However if KFDef specifies an existing cluster which happens to be the same cluster where the operator is running then it would be installed on that cluster.

How does that match whats' implemented in this behavior?

jlewi

Reviewed 8 of 19 files at r1, 1 of 2 files at r2, 1 of 3 files at r3, 1 of 5 files at r5, 2 of 5 files at r6.
Reviewable status: 13 of 20 files reviewed, 18 unresolved discussions (waiting on @animeshsingh, @gabrielwen, @kkasravi, @krishnadurai, @Tomcli, and @vpavlin)

deploy/operator.yaml, line 20 at r3 (raw file):

Previously, animeshsingh (Animesh Singh) wrote…

Let's figure out a versioning scheme and final location for this image

Should we just version it as 0.1.0?
The only other scheme that I can think of would be to try to match Kubeflow versioning which would be 1.0. But this isn't 1.0 yet so I don't think we should do that.

jlewi · 2020-01-24T00:37:10Z

No major comments on my end.
Overall this looks pretty good.

animeshsingh · 2020-01-24T23:55:09Z

This is the pattern GCP is adoping with KCC
https://cloud.google.com/config-connector/docs/overview

What about the current implementation assumes that Kubeflow is being deployed onto the current cluster?

It looks like the operator is just calling kfctl Apply.

So with this version, 0.1, the expectation is that this is running on same cluster, using in-cluster-config to know about the cluster, and relies on cluster events for watcher. This is the general pattern we have seen, and can start looking KCC like pattern for the next iteration.

So overall this looks good to go.

jlewi · 2020-01-28T01:02:14Z

So with this version, 0.1, the expectation is that this is running on same cluster, using in-cluster-config to know about the cluster, and relies on cluster events for watcher. This is the general pattern we have seen, and can start looking KCC like pattern for the next iteration.

For version 0.1 I would like the operator to be able to mimic kfctl behavior which would essentially mean being to create and deploy on clusters different from the operator is running.

For all intents and purposes the only difference between using the operator and running kfctl is that in the former kfctl is running in a pod on a cluster as opposed to your local machine.

Unless I'm mistaken it already looks like that is what the code is doing because it is just invoking
KFDef.Apply
https://github.com/kubeflow/kfctl/pull/189/files#diff-add365d9fedf7b3665f97780bce9f92fR228

Thus the logic for determining which cluster is used should be determined by the existing kfctl libraries.

I'm pointing this out mainly to avoid confusion on subsequent PRs. i.e the operator being able to deploy KF on other clusters should be considered working as intended and not a bug to be fixed in subsequent Prs.

/hold cancel

animeshsingh · 2020-01-28T03:12:05Z

For all intents and purposes the only difference between using the operator and running kfctl is that in the former kfctl is running in a pod on a cluster as opposed to your local machine.

@jlewi Not necessarily. kfctl is fire and forget, whereas an Operator has a much bigger responsibility to continuously monitor, and continue to fix the deployment in background if things go wrong. That's where things differ. Please look at the code here
https://github.com/kubeflow/kfctl/pull/189/files#diff-add365d9fedf7b3665f97780bce9f92fR82

Concept of watcher, if not relying on in-cluster config and events coming from underlying cluster, goes into then continuously polling mode, which is not conducive. Hence its not a bug, but a design pattern decision to adopt something like KCC. Most of the deployments I am aware of vis a vis Operators, the operators are running on same cluster. So IMHO this is a good first pass which covers 80% of deployment scenarios, and in subsequent iterations we start looking at remote deployments.

animeshsingh · 2020-01-28T04:39:33Z

/retest

jlewi · 2020-01-28T05:10:27Z

Can't the watchers be configured to point at a different cluster? Could you open an issue please to track having the operator create the cluster and deploy on that cluster? Otherwise the vast majority of KFDef specs won't be compatible the operator.

animeshsingh · 2020-01-28T06:25:29Z

Can't the watchers be configured to point at a different cluster?

Yes, and as I mentioned, can be followed in a separate PR? The first review comments was break it into smaller PRs - so in the same spirit it doesn't make sense to continue doubling down on this?

Could you open an issue please to track having the operator create the cluster and deploy on that cluster? Otherwise the vast majority of KFDef specs won't be compatible the operator.

Have created a roadmap issue
#193

Tomcli · 2020-01-28T17:00:22Z

/retest

animeshsingh · 2020-01-28T18:10:13Z

/retest

Tomcli · 2020-01-28T19:21:45Z

/test kubeflow-kfctl-presubmit

kunmingg · 2020-01-28T19:23:14Z

You might need to rebase on master

Tomcli · 2020-01-28T23:22:51Z

Hi @kunmingg, I rebased my commits and the test is still failed at kfupgrade. Looks like some of the pods are unavailable after the upgrade.

kunmingg · 2020-01-28T23:33:06Z

/retest

animeshsingh · 2020-01-29T17:25:49Z

/retest

animeshsingh · 2020-01-29T17:26:29Z

@kunmingg @jlewi if it fails even this time, we would appreciate some help. The tests here have been behaving very flaky

jlewi · 2020-01-29T23:51:12Z

@animeshsingh @Tomcli Yes the tests are flaky. That's because development has outpaced investments in test infrastructure. See for example:

#57
kubeflow/testing#585

I think my question to you would be is if we are triggering unnecessary tests what improvements can we make to avoid triggering those tests?

jlewi · 2020-01-29T23:55:37Z

It looks like this PR isn't modifying existing code so using the existing include features in the test infra or by investing in an exclude mechanism like kubeflow/testing#585

we should be able to adjust the tests to avoid triggering them when the files in this PR are modified.

remove old operator version package Apply suggestions from code review Thanks for the suggestions.

Tomcli · 2020-01-30T06:22:52Z

I rebased my commits and the presubmit test should be good to go now.

jlewi · 2020-01-30T07:42:34Z

/lgtm
/approve

k8s-ci-robot · 2020-01-30T07:42:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2020-01-30T07:42:45Z

Woo Hoo!

k8s-ci-robot requested review from gabrielwen and kkasravi January 22, 2020 01:16

k8s-ci-robot added the size/XXL label Jan 22, 2020

animeshsingh reviewed Jan 22, 2020

View reviewed changes

Tomcli changed the title ~~Initial operator code~~ [WIP] Initial operator code Jan 22, 2020

k8s-ci-robot added the do-not-merge/work-in-progress label Jan 22, 2020

animeshsingh mentioned this pull request Jan 22, 2020

Kubeflow should allow deployment and management through Operator(s) kubeflow/kubeflow#4570

Closed

Tomcli changed the title ~~[WIP] Initial operator code~~ Initial operator code Jan 22, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress label Jan 22, 2020

animeshsingh suggested changes Jan 22, 2020

View reviewed changes

vpavlin reviewed Jan 23, 2020

View reviewed changes

deploy/operator.yaml Outdated Show resolved Hide resolved

vpavlin reviewed Jan 23, 2020

View reviewed changes

operator.md Outdated Show resolved Hide resolved

jlewi reviewed Jan 24, 2020

View reviewed changes

k8s-ci-robot added the approved label Jan 24, 2020

k8s-ci-robot removed the do-not-merge/hold label Jan 28, 2020

k8s-ci-robot removed the lgtm label Jan 28, 2020

Tomcli added 4 commits January 29, 2020 21:25

Initial operator code

8361069

remove old operator version package Apply suggestions from code review Thanks for the suggestions.

remove hard coded namespace and update the corresponding instructions

7dd8267

update dockerfile with multi-stage builds

051f7d0

update operator version tag

f0c0c21

k8s-ci-robot added the lgtm label Jan 30, 2020

k8s-ci-robot merged commit c5c55f6 into kubeflow:master Jan 30, 2020

adrian555 mentioned this pull request Feb 22, 2020

Fix license issue: update go.mod to remove bou.ke/monkey dependency #251

Merged

Initial operator code #189

Initial operator code #189

Conversation

Tomcli commented Jan 22, 2020 • edited Loading

Operator code structure:

googlebot commented Jan 22, 2020

Tomcli commented Jan 22, 2020

animeshsingh left a comment

Choose a reason for hiding this comment

animeshsingh commented Jan 22, 2020

k8s-ci-robot commented Jan 22, 2020

jlewi commented Jan 22, 2020

jlewi commented Jan 22, 2020

jlewi commented Jan 22, 2020

googlebot commented Jan 22, 2020

animeshsingh commented Jan 22, 2020 • edited Loading

animeshsingh commented Jan 22, 2020

nrchakradhar commented Jan 23, 2020

animeshsingh commented Jan 23, 2020

nrchakradhar commented Jan 23, 2020

swiftdiaries commented Jan 23, 2020

kunmingg commented Jan 23, 2020

jlewi commented Jan 24, 2020 • edited Loading

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Jan 24, 2020

animeshsingh commented Jan 24, 2020

jlewi commented Jan 28, 2020

animeshsingh commented Jan 28, 2020 • edited Loading

animeshsingh commented Jan 28, 2020

jlewi commented Jan 28, 2020

animeshsingh commented Jan 28, 2020

Tomcli commented Jan 28, 2020

animeshsingh commented Jan 28, 2020

Tomcli commented Jan 28, 2020

kunmingg commented Jan 28, 2020

Tomcli commented Jan 28, 2020

kunmingg commented Jan 28, 2020

animeshsingh commented Jan 29, 2020

animeshsingh commented Jan 29, 2020

jlewi commented Jan 29, 2020

jlewi commented Jan 29, 2020

Tomcli commented Jan 30, 2020

jlewi commented Jan 30, 2020

k8s-ci-robot commented Jan 30, 2020

jlewi commented Jan 30, 2020

Tomcli commented Jan 22, 2020 •

edited

Loading

animeshsingh commented Jan 22, 2020 •

edited

Loading

jlewi commented Jan 24, 2020 •

edited

Loading

animeshsingh commented Jan 28, 2020 •

edited

Loading