[Discussion] Do we need the kubeflow dependency #341

ahg-g · 2021-03-17T20:30:14Z

I am wondering if mpi-operator should be decoupled from kubeflow since kubeflow is not really a library? Is there any fundamental dependency on kubeflow?

cc @terrytangyuan @alculquicondor @kubeflow/wg-training-leads

terrytangyuan · 2021-03-17T20:37:08Z

For MPI operator specifically, there isn't much dependency on Kubeflow except that it's based on the common interface defined in https://github.com/kubeflow/common. Implementation is relatively independent.

@kubeflow/wg-training-leads

ahg-g · 2021-03-17T21:05:36Z

The kubeflow dependency is confusing, the shared API elements seem unnecessary, in our opinion. Removing the dependency would allow us to simplify the API and evolve it differently from kubeflow, to better tailor MPI jobs.

I would like to suggest moving the repo to a sig-kubernetes repo under sig apps, the benefits are:

Potentially increase adoption and perhaps unify the community around a standard way for deploying MPI on k8s. Getting sig apps expertise and support will help a lot in that regard.
Align with core k8s project standards in terms of api reviews, release and testing. We will not get that by default, but being in the core k8s repo will get us access to the tools that allows us to make the operator k8s level production ready.

terrytangyuan · 2021-03-17T23:08:56Z

That definitely sounds like a good/interesting option to me. It's more ambitious to make MPIJob more general and useful to a wider range of users. Could you point me to a specific location where this might fit? A couple of concerns/questions from my side:

Are there potential issues/blockers to migrate the repo from legal perspective?
There are currently users who consume this repo as a library or develop based on a fork of this repo. Is migrating the repo going to affect that and break their existing usages? Can we perhaps start with something minimal and fresh in a sig-k8s apps repo and then iterate from there?
Will we reach agreement among @kubeflow/wg-training-leads on this? I agree that MPI operator can be more generic and can benefit non-ML users and currently MPI operator is the only one that's less framework-specific and rather lightweight within Kubeflow.

gaocegege · 2021-03-18T02:04:14Z

Will we reach agreement among @kubeflow/wg-training-leads on this? I agree that MPI operator can be more generic and can benefit non-ML users and currently MPI operator is the only one that's less framework-specific and rather lightweight within Kubeflow.

I think it works. But we should discuss how to deal with some Horovod specific features in the operator.

ahg-g · 2021-03-18T03:30:44Z

There are currently users who consume this repo as a library or develop based on a fork of this repo. Is migrating the repo going to affect that and break their existing usages? Can we perhaps start with something minimal and fresh in a sig-k8s apps repo and then iterate from there?

That would be my preference, we can work on a new API and also avoid any potential legal issues. But if we decide to start fresh, I would suggest that we freeze this repo to avoid fragmentation and focus the community on one effort.

gaocegege · 2021-03-18T11:58:36Z

Should we discuss it in the Training WG community meeting?

/cc @andreyvelich @zw0610

zw0610 · 2021-03-18T12:13:23Z

Should we discuss it in the Training WG community meeting?

/cc @andreyvelich @zw0610

Definitely. I would like also hear more user cases and requests about running generic MPI jobs on Kubernetes.

ahg-g · 2021-03-18T13:50:55Z

Sounds good, lets discuss it in the next meeting, did yesterday meeting happen?

I would like also hear more user cases and requests about running generic MPI jobs on Kubernetes.

You mean outside ML training?

andreyvelich · 2021-03-18T14:07:58Z

We can discuss this in the next AutoML and Training WG meeting.
Check the calendar: https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ.

Also, I think it would be great if we can high light this on the Kubeflow community meeting, since we have more Kubeflow members there.
What do you think @gaocegege @terrytangyuan @ahg-g ?

ahg-g · 2021-03-18T14:17:22Z

Thanks @andreyvelich, that sounds great! The wider the buy in the better.

terrytangyuan · 2021-03-28T03:10:01Z

Would anyone like to provide a summary of the discussion during community meeting here?

andreyvelich · 2021-03-29T13:23:55Z

Would anyone like to provide a summary of the discussion during community meeting here?

On the call we decided that @ahg-g will present this proposal on the Kubeflow community call.
Then, we can discuss with Kubeflow steering group how we can move forward.
@terrytangyuan Since you are one of the owners for the MPI Operator, it would be great if you could also join the discussion in the Kubeflow community call.

Thank you for driving this @ahg-g!

terrytangyuan · 2021-03-29T13:35:17Z

Thanks @ahg-g and @andreyvelich.

I will certainly retry to attend but it would be great if we can have a proposal publicly so others from different time zones can also participate to discuss asynchronously. At this point I think I am interested to know more details about the use cases of the proposal before exploring this further. We also need to think thoroughly on this since there are already many adopters/users of MPI Operator already so we want to reduce the impact if possible.

andreyvelich · 2021-03-29T13:40:24Z

I agree with it @terrytangyuan.
@ahg-g Please, can you prepare the proposal about this migration ?
You can explain in the proposal how that will help MPI Operator project to be successful.
As @terrytangyuan said, we can share this proposal across Kubeflow group and get the feedback.

ahg-g · 2021-03-29T13:41:40Z

Thanks all, do you have a specific format and repo for such proposals, like k8s KEP?

terrytangyuan · 2021-03-29T13:45:28Z

You can submit it to: https://github.com/kubeflow/mpi-operator/tree/master/proposals

Our current community proposal template is pretty simple but feel free to use other format/template if preferred.

alculquicondor · 2021-04-20T14:38:56Z

To give an update, I'm going to do a deep dive into the current state of the operator and present a complete proposal later.

thesuperzapper · 2021-04-29T01:56:59Z

Just wanted to weigh in with some thoughts:

Most Kubeflow components are "loosely coupled", so in my mind, the definition of "being part of kubeflow" is mostly about if a Kubeflow Working Group is responsible for it
The committers of the mpi-operator can decide where the best home for it is, but I think being under the Kubeflow brand certainly will bring more attention to the tool

gaocegege · 2021-04-29T02:41:01Z

/cc @zw0610

alculquicondor · 2021-05-17T20:37:08Z

Here is our proposal: #360

We decided not to move forward with the proposal of migrating mpi-operator to kubernetes-sigs.

terrytangyuan · 2021-05-18T18:59:18Z

Thanks for the proposal.

We decided not to move forward with the proposal of migrating mpi-operator to kubernetes-sigs.

Just curious, who are "we" and how was the decision made?

ahg-g · 2021-05-18T19:05:28Z

Thanks for the proposal.

We decided not to move forward with the proposal of migrating mpi-operator to kubernetes-sigs.

Just curious, who are "we"

Just the people who initially proposed to move it out (myself and Aldo)

and how was the decision made?

We sensed some resistance, and so we felt that it is probably more productive if we focus our efforts on collaborating and contributing to improve the operator where it exists.

alculquicondor · 2022-02-23T19:09:42Z

There is some renewed interest in moving the mpi-operator to (possibly) kubernetes-sigs, assuming that SIG Apps is willing to sponsor the repo.

The motivation is to encourage non-training users (like HPC) to use and contribute to it, without having to install or learn about kubeflow's training-operator.

Let us know what your thoughts are in favor or against this.

cc @ArangoGutierrez

rongou · 2022-02-23T19:30:12Z

FWIW I think it's a good idea. The current solution has always been a "temporary" one, eventually it'd be nice to implement a "native" solution (e.g. through PMIx #12).

alculquicondor · 2022-02-23T19:50:00Z

I agree. Using kubectl exec brings scalability problems for the control plane and ssh causes a lot of problems because of permissions, plus it's added communication overhead for the workload. It would be great to have PMIx support.

ArangoGutierrez · 2022-02-23T19:53:22Z

PMIx should continue to be the north start for the operator. in the mean time, moving it to k8s-sigs is a good move for the HPC non AI/ML community that also want to move some workflows to cloud environments.

ArangoGutierrez · 2022-02-23T19:55:15Z

I have been discussing with PMIx developers like @jjhursey on possible ways to develop a k8s PLM for prrte, so we can start building a real solution moving forward. but we are still on brainstorming conversations, noting to show yet

alculquicondor · 2022-02-23T20:01:45Z

Ok, I made the mistake of continue to mention PMIx on this thread. We should probably leave that discussion for #12.

ArangoGutierrez · 2022-02-23T20:06:21Z

TLDR: I agree and support donating the MPI-Operator to k8s-sigs. happy to lead the effort with whom ever want's to help

terrytangyuan · 2022-02-23T20:41:01Z

+1 to donate to a more generic/neutral place (e.g. K8s SIG Apps) where this project could be beneficial to more people, not limiting to ML-specific workloads. Especially given that now kubeflow/training-operator gets heavier, people should be given the option to only install MPI Operator for their use cases. I hope this direction could help attract more users and contributors to this project going forward.

ArangoGutierrez · 2022-03-07T21:54:28Z

In today's Mar-7 2022 SIG-APP call we discussed about this (https://github.com/kubernetes/community/tree/master/sig-apps#meetings)

TLDR: we have a go to start the process to donate the MPI-Operator to kubernetes-sigs under the sponsorship of sig-app, @soltysh will provide more details

zw0610 · 2022-03-08T00:58:10Z

I agree with @rongou that an implementation with PMIx would benefit generic batch jobs.

Originated as an operator for data-parallel distributed training, we would keep the existing implementation in training-operator, which helps deep learning researchers to launch horovod job. @kubeflow/wg-training-leads

@ArangoGutierrez Just for curiosity, will the donation process be like implement PMIx scheme in kubeflow/mpi-operator first and then transfer to kubernetes-sigs or transfer repo to kubernetes-sigs first and then implement PMIx scheme?

ArangoGutierrez · 2022-03-08T01:04:56Z

I would say the later

first donate
Stabilize the current code
cut a first release under k8s-sigs 0.4 or something (current is 0.3) or we can even start back at 0.1
have PMIx as a KEP
PoC PMIx KEP
Implement KEP

ArangoGutierrez · 2022-03-08T01:06:10Z

on the PMIx front

[mpi@hello-world-launcher ~]$ prun --pid 196 
hello-world-worker-0
hello-world-worker-1

PoC already under development :)

soltysh · 2022-03-08T12:41:36Z

In today's Mar-7 2022 SIG-APP call we discussed about this (https://github.com/kubernetes/community/tree/master/sig-apps#meetings)

TLDR: we have a go to start the process to donate the MPI-Operator to kubernetes-sigs under the sponsorship of sig-app, @soltysh will provide more details

Correct, SIG-Apps will be happy to sponsor mpi-operator as a subproject. For this to happen I'll work with @ArangoGutierrez and @alculquicondor to follow the process as described in https://github.com/kubernetes/community/blob/master/github-management/kubernetes-repositories.md#rules-for-donated-repositories

alculquicondor · 2022-03-08T15:09:55Z

@soltysh let's make sure the OWNERS are migrated as well https://github.com/kubeflow/mpi-operator/blob/master/OWNERS

we would keep the existing implementation in training-operator, which helps deep learning researchers to launch horovod job.

I suppose, long-term, you would use the kubernetes-sig implementation as a dependency, right?

ArangoGutierrez · 2022-03-08T15:29:17Z

@soltysh let's make sure the OWNERS are migrated as well https://github.com/kubeflow/mpi-operator/blob/master/OWNERS

we would keep the existing implementation in training-operator, which helps deep learning researchers to launch horovod job.

I suppose, long-term, you would use the kubernetes-sig implementation as a dependency, right?

question for @zw0610

soltysh · 2022-03-08T15:37:53Z

@soltysh let's make sure the OWNERS are migrated as well https://github.com/kubeflow/mpi-operator/blob/master/OWNERS

👍

alculquicondor · 2022-03-08T17:44:26Z

/retitle move mpi-operator to kubernetes-sigs

zw0610 · 2022-03-09T01:26:46Z

@soltysh let's make sure the OWNERS are migrated as well https://github.com/kubeflow/mpi-operator/blob/master/OWNERS

we would keep the existing implementation in training-operator, which helps deep learning researchers to launch horovod job.

I suppose, long-term, you would use the kubernetes-sig implementation as a dependency, right?

question for @zw0610

Seems not a question I can give a confirmative answer right now. The main concern comes from the elastic training feature, which will be the major point for the training-operator. (horovod-elastic requires horovodrun instead of mpirun to launch new training process on new workers.)

We can verify the compatibility between horovod-elastic and the PMIx-based mpi-operator after the later one is implemented. If things turn out to be working, I suppose we can use the PMIx-based solution as the dependency.

terrytangyuan · 2022-03-09T01:54:12Z

Great! @alculquicondor @soltysh Let’s kick off the process and let me know if there’s anything we can help.

gaocegege · 2022-03-09T03:00:20Z

/cc @richardsliu

theadactyl · 2022-03-09T18:35:15Z

Hi all, is there a formal proposal somewhere? I'd like to see the overall proposal, reasoning, implications, and plan.

alculquicondor · 2022-03-09T18:57:37Z

Is there a template we can use? Otherwise I could summarize here. Or maybe a new issue is preferred?

ArangoGutierrez · 2022-03-09T19:02:10Z

I think a new issue with checklist/milestones is preferred

ArangoGutierrez · 2022-03-09T19:04:32Z

#459

ArangoGutierrez · 2022-03-11T22:58:26Z

kubeflow/community#557

ArangoGutierrez mentioned this issue Mar 9, 2022

Donate repo to kubernetes-sigs #459

Closed

10 tasks

terrytangyuan mentioned this issue Apr 25, 2024

REQUEST: New membership for @terrytangyuan kubernetes/org#4904

Closed

11 tasks

[Discussion] Do we need the kubeflow dependency #341

[Discussion] Do we need the kubeflow dependency #341

Comments

ahg-g commented Mar 17, 2021

terrytangyuan commented Mar 17, 2021

ahg-g commented Mar 17, 2021

terrytangyuan commented Mar 17, 2021 • edited Loading

gaocegege commented Mar 18, 2021

ahg-g commented Mar 18, 2021

gaocegege commented Mar 18, 2021

zw0610 commented Mar 18, 2021

ahg-g commented Mar 18, 2021

andreyvelich commented Mar 18, 2021

ahg-g commented Mar 18, 2021

terrytangyuan commented Mar 28, 2021

andreyvelich commented Mar 29, 2021

terrytangyuan commented Mar 29, 2021

andreyvelich commented Mar 29, 2021

ahg-g commented Mar 29, 2021

terrytangyuan commented Mar 29, 2021

alculquicondor commented Apr 20, 2021

thesuperzapper commented Apr 29, 2021 • edited Loading

gaocegege commented Apr 29, 2021

alculquicondor commented May 17, 2021

terrytangyuan commented May 18, 2021

ahg-g commented May 18, 2021

alculquicondor commented Feb 23, 2022 • edited Loading

rongou commented Feb 23, 2022

alculquicondor commented Feb 23, 2022

ArangoGutierrez commented Feb 23, 2022

ArangoGutierrez commented Feb 23, 2022

alculquicondor commented Feb 23, 2022

ArangoGutierrez commented Feb 23, 2022

terrytangyuan commented Feb 23, 2022 • edited Loading

ArangoGutierrez commented Mar 7, 2022

zw0610 commented Mar 8, 2022

ArangoGutierrez commented Mar 8, 2022 • edited Loading

ArangoGutierrez commented Mar 8, 2022

soltysh commented Mar 8, 2022

alculquicondor commented Mar 8, 2022

ArangoGutierrez commented Mar 8, 2022

soltysh commented Mar 8, 2022

alculquicondor commented Mar 8, 2022

zw0610 commented Mar 9, 2022

terrytangyuan commented Mar 9, 2022

gaocegege commented Mar 9, 2022

theadactyl commented Mar 9, 2022

alculquicondor commented Mar 9, 2022

ArangoGutierrez commented Mar 9, 2022

ArangoGutierrez commented Mar 9, 2022

ArangoGutierrez commented Mar 11, 2022

terrytangyuan commented Mar 17, 2021 •

edited

Loading

thesuperzapper commented Apr 29, 2021 •

edited

Loading

alculquicondor commented Feb 23, 2022 •

edited

Loading

terrytangyuan commented Feb 23, 2022 •

edited

Loading

ArangoGutierrez commented Mar 8, 2022 •

edited

Loading