-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Do we need the kubeflow dependency #341
Comments
For MPI operator specifically, there isn't much dependency on Kubeflow except that it's based on the common interface defined in https://github.com/kubeflow/common. Implementation is relatively independent. @kubeflow/wg-training-leads |
The kubeflow dependency is confusing, the shared API elements seem unnecessary, in our opinion. Removing the dependency would allow us to simplify the API and evolve it differently from kubeflow, to better tailor MPI jobs. I would like to suggest moving the repo to a sig-kubernetes repo under sig apps, the benefits are:
|
That definitely sounds like a good/interesting option to me. It's more ambitious to make MPIJob more general and useful to a wider range of users. Could you point me to a specific location where this might fit? A couple of concerns/questions from my side:
|
I think it works. But we should discuss how to deal with some Horovod specific features in the operator. |
That would be my preference, we can work on a new API and also avoid any potential legal issues. But if we decide to start fresh, I would suggest that we freeze this repo to avoid fragmentation and focus the community on one effort. |
Should we discuss it in the Training WG community meeting? /cc @andreyvelich @zw0610 |
Definitely. I would like also hear more user cases and requests about running generic MPI jobs on Kubernetes. |
Sounds good, lets discuss it in the next meeting, did yesterday meeting happen?
You mean outside ML training? |
We can discuss this in the next AutoML and Training WG meeting. Also, I think it would be great if we can high light this on the Kubeflow community meeting, since we have more Kubeflow members there. |
Thanks @andreyvelich, that sounds great! The wider the buy in the better. |
Would anyone like to provide a summary of the discussion during community meeting here? |
On the call we decided that @ahg-g will present this proposal on the Kubeflow community call. Thank you for driving this @ahg-g! |
Thanks @ahg-g and @andreyvelich. I will certainly retry to attend but it would be great if we can have a proposal publicly so others from different time zones can also participate to discuss asynchronously. At this point I think I am interested to know more details about the use cases of the proposal before exploring this further. We also need to think thoroughly on this since there are already many adopters/users of MPI Operator already so we want to reduce the impact if possible. |
I agree with it @terrytangyuan. |
Thanks all, do you have a specific format and repo for such proposals, like k8s KEP? |
You can submit it to: https://github.com/kubeflow/mpi-operator/tree/master/proposals Our current community proposal template is pretty simple but feel free to use other format/template if preferred. |
To give an update, I'm going to do a deep dive into the current state of the operator and present a complete proposal later. |
Just wanted to weigh in with some thoughts:
|
/cc @zw0610 |
Here is our proposal: #360 We decided not to move forward with the proposal of migrating mpi-operator to kubernetes-sigs. |
Thanks for the proposal.
Just curious, who are "we" and how was the decision made? |
Just the people who initially proposed to move it out (myself and Aldo)
We sensed some resistance, and so we felt that it is probably more productive if we focus our efforts on collaborating and contributing to improve the operator where it exists. |
There is some renewed interest in moving the mpi-operator to (possibly) kubernetes-sigs, assuming that SIG Apps is willing to sponsor the repo. The motivation is to encourage non-training users (like HPC) to use and contribute to it, without having to install or learn about kubeflow's training-operator. Let us know what your thoughts are in favor or against this. |
FWIW I think it's a good idea. The current solution has always been a "temporary" one, eventually it'd be nice to implement a "native" solution (e.g. through PMIx #12). |
I agree. Using |
PMIx should continue to be the north start for the operator. in the mean time, moving it to k8s-sigs is a good move for the HPC non AI/ML community that also want to move some workflows to cloud environments. |
I have been discussing with PMIx developers like @jjhursey on possible ways to develop a k8s PLM for prrte, so we can start building a real solution moving forward. but we are still on brainstorming conversations, noting to show yet |
Ok, I made the mistake of continue to mention PMIx on this thread. We should probably leave that discussion for #12. |
TLDR: I agree and support donating the MPI-Operator to k8s-sigs. happy to lead the effort with whom ever want's to help |
+1 to donate to a more generic/neutral place (e.g. K8s SIG Apps) where this project could be beneficial to more people, not limiting to ML-specific workloads. Especially given that now kubeflow/training-operator gets heavier, people should be given the option to only install MPI Operator for their use cases. I hope this direction could help attract more users and contributors to this project going forward. |
In today's Mar-7 2022 SIG-APP call we discussed about this (https://github.com/kubernetes/community/tree/master/sig-apps#meetings) TLDR: we have a go to start the process to donate the MPI-Operator to kubernetes-sigs under the sponsorship of sig-app, @soltysh will provide more details |
I agree with @rongou that an implementation with PMIx would benefit generic batch jobs. Originated as an operator for data-parallel distributed training, we would keep the existing implementation in training-operator, which helps deep learning researchers to launch horovod job. @kubeflow/wg-training-leads @ArangoGutierrez Just for curiosity, will the donation process be like |
I would say the later
|
on the PMIx front [mpi@hello-world-launcher ~]$ prun --pid 196
hello-world-worker-0
hello-world-worker-1 PoC already under development :) |
Correct, SIG-Apps will be happy to sponsor mpi-operator as a subproject. For this to happen I'll work with @ArangoGutierrez and @alculquicondor to follow the process as described in https://github.com/kubernetes/community/blob/master/github-management/kubernetes-repositories.md#rules-for-donated-repositories |
@soltysh let's make sure the OWNERS are migrated as well https://github.com/kubeflow/mpi-operator/blob/master/OWNERS
I suppose, long-term, you would use the kubernetes-sig implementation as a dependency, right? |
question for @zw0610 |
👍 |
/retitle move mpi-operator to kubernetes-sigs |
Seems not a question I can give a confirmative answer right now. The main concern comes from the elastic training feature, which will be the major point for the training-operator. (horovod-elastic requires We can verify the compatibility between horovod-elastic and the PMIx-based mpi-operator after the later one is implemented. If things turn out to be working, I suppose we can use the PMIx-based solution as the dependency. |
Great! @alculquicondor @soltysh Let’s kick off the process and let me know if there’s anything we can help. |
/cc @richardsliu |
Hi all, is there a formal proposal somewhere? I'd like to see the overall proposal, reasoning, implications, and plan. |
Is there a template we can use? Otherwise I could summarize here. Or maybe a new issue is preferred? |
I think a new issue with checklist/milestones is preferred |
I am wondering if mpi-operator should be decoupled from kubeflow since kubeflow is not really a library? Is there any fundamental dependency on kubeflow?
cc @terrytangyuan @alculquicondor @kubeflow/wg-training-leads
The text was updated successfully, but these errors were encountered: