Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray] launch multiple GPU with ray #396

Merged
merged 10 commits into from
Dec 20, 2024
Merged

Conversation

lihuahua123
Copy link
Contributor

Support Ray to start the pipeline

@lihuahua123 lihuahua123 force-pushed the main branch 2 times, most recently from 475668a to 61101c3 Compare December 17, 2024 12:31
xfuser/worker/worker.py Outdated Show resolved Hide resolved
xfuser/worker/worker.py Outdated Show resolved Hide resolved
@feifeibear
Copy link
Collaborator

PR实现了通过ray方式启动多进程。参考vllm使用RayGPUExecutor来管理多个worker,每个worker执行diffusers pipefline的逻辑。

目前这种方式和torchrun启动程序(example.py)用法差别太大。

我建议设计一个DiffusionPipeline的Ray分布式版本,RayDiffusionPipeline,然后这个类提供from_pretrained,forward等接口。

PR中hardcode了一些地方,比如对模型初始化text_encoder处理,因为目前text_encoder是没有多卡切分的,可以让每个worker都重复载入text_encoder,希望尽量保持和torchrun接口一致性。

setup.py Outdated Show resolved Hide resolved
xfuser/config/args.py Outdated Show resolved Hide resolved
examples/run.sh Outdated Show resolved Hide resolved
tests/executor/test_ray.py Outdated Show resolved Hide resolved
xfuser/executor/gpu_executor.py Outdated Show resolved Hide resolved
xfuser/worker/worker.py Outdated Show resolved Hide resolved
tests/executor/ray_run.sh Outdated Show resolved Hide resolved
@feifeibear feifeibear changed the title [WIP] Ray Support [WIP] launch multiple GPU with ray Dec 20, 2024
# output is a list of results from each worker, we take the last one
for i, image in enumerate(output[-1].images):
image.save(
f"/data/results/{model_name}_result_{i}.png"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save to a relative path ./results/xxx

@@ -188,6 +192,9 @@ class ParallelConfig:
sp_config: SequenceParallelConfig
pp_config: PipeFusionParallelConfig
tp_config: TensorParallelConfig
distributed_executor_backend: Optional[str] = None
world_size: int = 1 # FIXME: remove this
worker_cls: str = "xfuser.ray.worker.worker.Worker"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need distributed_executor_backend and worker_cls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need distributed_executor_backend, but we need worker_cls for ray to initial worker by its class name

def init_worker(self, *args, **kwargs):
      worker_class = resolve_obj_by_qualname(
          self.worker_cls)
      self.worker = worker_class(*args, **kwargs)
      assert self.worker is not None

Copy link
Collaborator

@feifeibear feifeibear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feifeibear feifeibear changed the title [WIP] launch multiple GPU with ray [ray] launch multiple GPU with ray Dec 20, 2024
@feifeibear feifeibear marked this pull request as ready for review December 20, 2024 07:26
@feifeibear feifeibear merged commit f58302a into xdit-project:main Dec 20, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants