Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kubeflow.org/pytorchjob #995

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jul 18, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Support kubeflow.org/pytorchjob.

Which issue(s) this PR fixes:

Part-of #297

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Support kubeflow.org/pytorchjob

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 18, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 18, 2023
@netlify
Copy link

netlify bot commented Jul 18, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 775ca92
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/64db11c388c0cb00082002b3

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 18, 2023
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch 7 times, most recently from 801b061 to bd3d770 Compare July 18, 2023 12:50
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 18, 2023
@tenzen-y
Copy link
Member Author

Blocked on kubeflow/trainer#1859.

@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch 2 times, most recently from 61b2e25 to 92a16ca Compare July 18, 2023 16:08
@tenzen-y
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2023
@tenzen-y tenzen-y changed the title WIP: Support kubeflow.org/pytorchjob Support kubeflow.org/pytorchjob Jul 18, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2023
@tenzen-y
Copy link
Member Author

/assign @trasc

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic approach around pkg/controller/jobs/kubeflow/kubeflowjob makes the implementation a bit hard to follow. Are we planing to support other similar kubeflow jobs in the near future?

@@ -32,3 +32,4 @@ integrations:
- "kubeflow.org/mpijob"
- "ray.io/rayjob"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/pytorchjob"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same should be done in charts/kueue/values.yaml

integrations:
frameworks:
- "batch/job"
- "kubeflow.org/mpijob"
- "ray.io/rayjob"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +125 to +134
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
verbs:
- get
- list
- patch
- update
- watch
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs/status
verbs:
- get
- update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two extra role files should be created config/components/rbac/pytorchjob_editor_role.yaml and config/components/rbac/pytorchjob_viewer_role.yaml (check the content of config/components/rbac/jobset_XXX_role.yaml ) and referenced in config/components/rbac/kustomization.yaml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

go.mod Outdated
@@ -83,3 +85,5 @@ require (
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect
sigs.k8s.io/yaml v1.3.0 // indirect
)

replace github.com/kubeflow/training-operator v1.6.0 => github.com/tenzen-y/training-operator v1.3.0-rc.1.0.20230717233919-1ed3e8e55322
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should drop the replace and either:

a. Hold this PR until kubeflow/trainer#1859 is merged and a new kubeflow/training-operator is available containing it.
b. Hold this PR until kubeflow/trainer#1859 is merged, use a non-release version of github.com/kubeflow/training-operator v1.7.0-xxx-xxxx and create a Followup pr to switch to a release version when available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do b.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 53 to 74
type PyTorchJob struct {
*kftraining.PyTorchJob
*kubeflowjob.KubeflowJob
}
Copy link
Contributor

@trasc trasc Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make Object() part of the KFJobControl you could remove the kftraining.PyTorchJob member, and potentially change this to :

Suggested change
type PyTorchJob struct {
*kftraining.PyTorchJob
*kubeflowjob.KubeflowJob
}
type PyTorchJob kubeflowjob.KubeflowJob

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Composition, we need to use an embedded struct. So I will modify this in the following:

type PyTorchJob struct {
	*kubeflowjob.KubeflowJob
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work the same way, it is used like this to "make" the generic job out of all integrations we currently have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc Does that mean we should use the type alias instead of the embedded struct?

If doing so, it means that we remove kubeflowjob.KubeflowJob, and then in all kubeflow integrations (TFJob, PaddleJob, ...), we declare the same processes such as RunWithPodSetsInfo, RestorePodSetsInfo and so on.

Copy link
Contributor

@trasc trasc Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it , you could do the same for GetGVK() and might be able to make KubeflowJob generic and

var NewReconciler = jobframework.NewGenericReconciler(func() jobframework.GenericJob {
-	pytorchJobObj := &kftraining.PyTorchJob{}
-	return &PyTorchJob{PyTorchJob: pytorchJobObj, KubeflowJob: kubeflowjob.NewKubeflowJob((*JobControl)(pytorchJobObj))}
+	return &kftraining.PyTorchJob{}
}, nil)

...

-type PyTorchJob struct {
-	*kftraining.PyTorchJob
-	*kubeflowjob.KubeflowJob
-}
+ type PyTorchJob  KubeflowJob[JobControl]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Object is part of KFJobControl, then you can change

func (j *KubeflowJob) Object() client.Object {

  • return nil
  • return j.KFJobControl.Object()
    }
    Then you don't need the *kftraining.PyTorchJob memeber.

Then you can simplify the struct, and also make it a type "redefinition" type PyTorchJob kubeflowjob.KubeflowJob.

Thanks for the clarifications.
If doing so, I think type PyTorchJob kubeflowjob.KubeflowJob doesn't implement the GenericJob interface, right?

Although type JobControl kftraining.PyTorchJob implement the GenericJob.

Copy link
Contributor

@trasc trasc Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tooo many interfaces :)... ,

To keep it simple I think kubeflowjob.KubeflowJob should have all the common code for kubeflow(kftraining)) Jobs implementing GenericJob. and use KFJobControl to get it's job (eg. kftraining.PyTorchJob) specific bits.

To avoid confusion, KFJobControl should be a named member, don't embed.

type KubeflowJob struct {
	c KFJobControl
}

The reconciler setup could look like:

var NewReconciler = jobframework.NewGenericReconciler(func() jobframework.GenericJob {
	return &kubeflowjob.KubeflowJob{c:  &JobControl{}}
}, nil)

type JobControl kftraining.PyTorchJob

var _ kubeflowjob.KFJobControl = (*JobControl)(nil)
....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm. Actually, I adopted the same pattern you suggested before submitting this PR. Then, I switched implementation to now one since this one is more reusable.

However, I will roll back my implementation to your suggested one since if @trasc, who has much understanding of the kueue, is confused by this implementation, then almost everyone will be confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing header

also I think we could just import
"sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/pytorchjob"
in
pkg/controller/jobs/jobs.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we will add more framework support. So, I would like to put the linking for kubeflow here.

// Reference the job framework integration packages to ensure linking.
import (
	_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/pytorchjob"
	_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/tfjob"
	_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/paddlejob"
	_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/xgboostjob"
	_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/mxjob"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing header

Done.

@tenzen-y
Copy link
Member Author

The generic approach around pkg/controller/jobs/kubeflow/kubeflowjob makes the implementation a bit hard to follow. Are we planing to support other similar kubeflow jobs in the near future?

@trasc Yes. We will support TFJob, MXJob, XGboostJob, and PaddleJob.
cc: @alculquicondor

@kerthcet
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from kerthcet July 19, 2023 06:53
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from 3589969 to fc23a88 Compare August 9, 2023 07:02
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2023
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from 66222ab to 8fe23aa Compare August 9, 2023 07:12
@tenzen-y tenzen-y changed the title WIP: Support kubeflow.org/pytorchjob Support kubeflow.org/pytorchjob Aug 9, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 9, 2023
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from 8fe23aa to b384a36 Compare August 9, 2023 07:14
@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 9, 2023

@trasc This PR is ready for review. PTAL.

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2023
@tenzen-y
Copy link
Member Author

/assign @alculquicondor

@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from b384a36 to 37201a3 Compare August 14, 2023 17:58
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2023
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from 86c5b32 to 5fe074b Compare August 14, 2023 18:21
@tenzen-y
Copy link
Member Author

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the support-kubeflow-pytorchjob branch from 5fe074b to 775ca92 Compare August 15, 2023 05:48
@tenzen-y
Copy link
Member Author

Squashed.

@trasc
Copy link
Contributor

trasc commented Aug 16, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2023
@alculquicondor
Copy link
Contributor

/release-note-edit

Support kubeflow.org/pytorchjob

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023
@k8s-ci-robot k8s-ci-robot merged commit 04a1421 into kubernetes-sigs:main Aug 16, 2023
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone Aug 16, 2023
@tenzen-y tenzen-y deleted the support-kubeflow-pytorchjob branch August 16, 2023 12:31
@tenzen-y tenzen-y mentioned this pull request Aug 16, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants