Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deepspeed pipe] expand the partitioning method to support weights #186

Open
stas00 opened this issue Nov 9, 2021 · 2 comments
Open

[deepspeed pipe] expand the partitioning method to support weights #186

stas00 opened this issue Nov 9, 2021 · 2 comments
Labels
Good First Issue Good for newcomers

Comments

@stas00
Copy link
Contributor

stas00 commented Nov 9, 2021

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support partition_method type:embed:2|transformer:1 - or something like that - now the embed weights will get 2x partitioning weights and will get its own stage and all stages will be more balanced.

For context please see: #166 (comment)

It's actually not complicated at all. It's just a simple weighing scheme.

Let's look at partitioning weights to the code I quoted in the first para:

with 4 layers and 4 gpus

  1. type:transformer [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0] gets partitioned as [0, 0, 0, 1], [1], [1], [1, 0, 0, 0, 0]
  2. type:embed|transformer [0, 1, 0, 1, 1, 1 1, 0, 0, 1, 0] gets partitioned as [0, 1, 0, 1], [1], [1], [1, 0, 0, 1, 0] (or something similar - I haven't validated),

but what we want is this:

the initial weights should be: [0, 2, 0, 1, 1, 1 1, 0, 0, 2, 0] which now should gets partitioned as [0, 2], [0, 1, 1], [1, 1], [0, 0, 2, 0]

(note: I'm not exactly sure where the 0's belong, it should be easy to see with print debug or debugger)

For context: 250k dict for mt5 has a huge embedding. it's 2x bigger than a single layer (n 104B), that's why we want them partitioned so that an embedding has its own stage and then each 2 layers use another stage.

this is so in the case of 60 layers and 2 embeddings and 32 pipe stages.

and once we are happy we can contribute this to deepspeed.

p.s. need to think about the best syntax to use, probably weighted_type:embed:2|transformer:1

@jaketae
Copy link
Member

jaketae commented Nov 20, 2021

Would this involve creating a PR on the upstream?

@stas00
Copy link
Contributor Author

stas00 commented Nov 21, 2021

This could be done with monkey patching first and then later added upstream.

I'm just not sure we should start working on it until this Issue is fixed microsoft/DeepSpeed#1522.

As I commented in #166 (comment) we could use BNB to compensate for ZeRO1. But BNB has issues as well at the moment.

Meanwhile it was proposed to use a 150k vocab instead of 250k. I am going to see how it scales in the next few days and we will know if this is required or not. So I will update this Issue once I have more information.

thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants