-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support kubeflow.org/pytorchjob #995
Support kubeflow.org/pytorchjob #995
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
801b061
to
bd3d770
Compare
Blocked on kubeflow/trainer#1859. |
61b2e25
to
92a16ca
Compare
/hold |
/assign @trasc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generic approach around pkg/controller/jobs/kubeflow/kubeflowjob
makes the implementation a bit hard to follow. Are we planing to support other similar kubeflow jobs in the near future?
@@ -32,3 +32,4 @@ integrations: | |||
- "kubeflow.org/mpijob" | |||
- "ray.io/rayjob" | |||
- "jobset.x-k8s.io/jobset" | |||
- "kubeflow.org/pytorchjob" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same should be done in charts/kueue/values.yaml
kueue/charts/kueue/values.yaml
Lines 84 to 88 in f215a43
integrations: | |
frameworks: | |
- "batch/job" | |
- "kubeflow.org/mpijob" | |
- "ray.io/rayjob" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- pytorchjobs | ||
verbs: | ||
- get | ||
- list | ||
- patch | ||
- update | ||
- watch | ||
- apiGroups: | ||
- kubeflow.org | ||
resources: | ||
- pytorchjobs/status | ||
verbs: | ||
- get | ||
- update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two extra role files should be created config/components/rbac/pytorchjob_editor_role.yaml
and config/components/rbac/pytorchjob_viewer_role.yaml
(check the content of config/components/rbac/jobset_XXX_role.yaml
) and referenced in config/components/rbac/kustomization.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go.mod
Outdated
@@ -83,3 +85,5 @@ require ( | |||
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect | |||
sigs.k8s.io/yaml v1.3.0 // indirect | |||
) | |||
|
|||
replace github.com/kubeflow/training-operator v1.6.0 => github.com/tenzen-y/training-operator v1.3.0-rc.1.0.20230717233919-1ed3e8e55322 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should drop the replace and either:
a. Hold this PR until kubeflow/trainer#1859 is merged and a new kubeflow/training-operator is available containing it.
b. Hold this PR until kubeflow/trainer#1859 is merged, use a non-release version of github.com/kubeflow/training-operator v1.7.0-xxx-xxxx
and create a Followup pr to switch to a release version when available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
type PyTorchJob struct { | ||
*kftraining.PyTorchJob | ||
*kubeflowjob.KubeflowJob | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make Object()
part of the KFJobControl
you could remove the kftraining.PyTorchJob
member, and potentially change this to :
type PyTorchJob struct { | |
*kftraining.PyTorchJob | |
*kubeflowjob.KubeflowJob | |
} | |
type PyTorchJob kubeflowjob.KubeflowJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Composition, we need to use an embedded struct. So I will modify this in the following:
type PyTorchJob struct {
*kubeflowjob.KubeflowJob
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work the same way, it is used like this to "make" the generic job out of all integrations we currently have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trasc Does that mean we should use the type alias instead of the embedded struct?
If doing so, it means that we remove kubeflowjob.KubeflowJob
, and then in all kubeflow integrations (TFJob, PaddleJob, ...), we declare the same processes such as RunWithPodSetsInfo
, RestorePodSetsInfo
and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about it , you could do the same for GetGVK()
and might be able to make KubeflowJob
generic and
var NewReconciler = jobframework.NewGenericReconciler(func() jobframework.GenericJob {
- pytorchJobObj := &kftraining.PyTorchJob{}
- return &PyTorchJob{PyTorchJob: pytorchJobObj, KubeflowJob: kubeflowjob.NewKubeflowJob((*JobControl)(pytorchJobObj))}
+ return &kftraining.PyTorchJob{}
}, nil)
...
-type PyTorchJob struct {
- *kftraining.PyTorchJob
- *kubeflowjob.KubeflowJob
-}
+ type PyTorchJob KubeflowJob[JobControl]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Object is part of KFJobControl, then you can change
func (j *KubeflowJob) Object() client.Object {
- return nil
- return j.KFJobControl.Object()
}
Then you don't need the *kftraining.PyTorchJob memeber.Then you can simplify the struct, and also make it a type "redefinition" type PyTorchJob kubeflowjob.KubeflowJob.
Thanks for the clarifications.
If doing so, I think type PyTorchJob kubeflowjob.KubeflowJob
doesn't implement the GenericJob
interface, right?
Although type JobControl kftraining.PyTorchJob
implement the GenericJob
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tooo many interfaces :)... ,
To keep it simple I think kubeflowjob.KubeflowJob
should have all the common code for kubeflow(kftraining)) Jobs implementing GenericJob
. and use KFJobControl
to get it's job (eg. kftraining.PyTorchJob
) specific bits.
To avoid confusion, KFJobControl
should be a named member, don't embed.
type KubeflowJob struct {
c KFJobControl
}
The reconciler setup could look like:
var NewReconciler = jobframework.NewGenericReconciler(func() jobframework.GenericJob {
return &kubeflowjob.KubeflowJob{c: &JobControl{}}
}, nil)
type JobControl kftraining.PyTorchJob
var _ kubeflowjob.KFJobControl = (*JobControl)(nil)
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm. Actually, I adopted the same pattern you suggested before submitting this PR. Then, I switched implementation to now one since this one is more reusable.
However, I will roll back my implementation to your suggested one since if @trasc, who has much understanding of the kueue, is confused by this implementation, then almost everyone will be confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/controller/jobs/kubeflow/jobs/pytorchjob/pytorchjob_controller.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing header
also I think we could just import
"sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/pytorchjob"
in
pkg/controller/jobs/jobs.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we will add more framework support. So, I would like to put the linking for kubeflow here.
// Reference the job framework integration packages to ensure linking.
import (
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/pytorchjob"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/tfjob"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/paddlejob"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/xgboostjob"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs/mxjob"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing header
Done.
@trasc Yes. We will support TFJob, MXJob, XGboostJob, and PaddleJob. |
/cc |
3589969
to
fc23a88
Compare
66222ab
to
8fe23aa
Compare
8fe23aa
to
b384a36
Compare
@trasc This PR is ready for review. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @alculquicondor |
pkg/controller/jobs/kubeflow/kubeflowjob/kubeflowjob_controller.go
Outdated
Show resolved
Hide resolved
b384a36
to
37201a3
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
86c5b32
to
5fe074b
Compare
The kueue-controller-manager seems suddenly can not access the kube-apiserver: |
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
5fe074b
to
775ca92
Compare
Squashed. |
/lgtm |
/release-note-edit
/hold cancel |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Support kubeflow.org/pytorchjob.
Which issue(s) this PR fixes:
Part-of #297
Special notes for your reviewer:
Does this PR introduce a user-facing change?