Enable 2D sharding #17

alanwaketan · 2023-08-01T19:39:14Z

Summary:
This pull request adds 2D SPMD sharding to the table. It will shard both weights and activations. Here is the sharding strategy.

Let's say we have a 2D mesh (data, model) and data x model == num_devices:

input (data,, None, model)
embedding (model, data)
attn QKV (data, model)
attn O (model, data)
mlp gate, up (model, data)
mlp down (data, model)
activation (data,, None, model)

Currently you can specify the model dimension using a new option --spmd_2d_sharding, then the data dimension will be auto-calculated.

TODO: maybe we should have another option to specify whether or not we should shard the activations/inputs or shard them differently.

jonb377

LGTM, nice one Jiewen!

jonb377 · 2023-08-01T20:07:46Z

examples/pytorch/language-modeling/run_clm.py

+            data_model_mesh = xs.Mesh(device_ids, (data, mod))
+            model_data_mesh = xs.Mesh(device_ids, (mod, data))


Can you try with HybridMesh? It should provide some performance gain, but I haven't actually benchmarked the difference. Here and in modeling_llama.py

@khatwanimohit may have some benchmarked differences on the simple shardings.py script

Let me do that. Always forgot.

Fixed that too.

jonb377 · 2023-08-01T20:14:19Z

examples/pytorch/language-modeling/run_clm.py

+            elif 'gate_proj' in name or 'up_proj' in name:
+                xs.mark_sharding(param, data_model_mesh, range(len(param.shape)))
+            elif 'down_proj' in name:
+                xs.mark_sharding(param, model_data_mesh, range(len(param.shape)))


Just for my understanding: I noticed that HF shards gate_proj and up_proj on the 0th dim and down_proj on the 1st dim, but here you're sharding gate and up on the data_model mesh, which places the model axis on dim 1.

Is this just a difference in 1- and 2-D sharding?

That's a good catch. I don't know. Let me dig into it. I'm following the slides attached on the top of the spreadsheet.

No worries! I was just curious, using the sharding from the slides makes sense.

Yea, you are right. I have corrected the error.

alanwaketan · 2023-08-01T22:08:14Z

Thanks Jon for approving the pull request.

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

Summary: This pull request adds 2D SPMD sharding to the table. It will shard both weights and activations. Here is the sharding strategy. Let's say we have a 2D mesh (data, model) and data x model == num_devices: 1. input (data,, None, model) 2. embedding (model, data) 3. attn QKV (data, model) 4. attn O (model, data) 5. mlp gate, up (model, data) 6. mlp down (data, model) 7. activation (data,, None, model) Currently you can specify the model dimension using a new option --spmd_2d_sharding, then the data dimension will be auto-calculated. TODO: maybe we should have another option to specify whether or not we should shard the activations/inputs or shard them differently.

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

Summary: This pull request adds 2D SPMD sharding to the table. It will shard both weights and activations. Here is the sharding strategy. Let's say we have a 2D mesh (data, model) and data x model == num_devices: 1. input (data,, None, model) 2. embedding (model, data) 3. attn QKV (data, model) 4. attn O (model, data) 5. mlp gate, up (model, data) 6. mlp down (data, model) 7. activation (data,, None, model) Currently you can specify the model dimension using a new option --spmd_2d_sharding, then the data dimension will be auto-calculated. TODO: maybe we should have another option to specify whether or not we should shard the activations/inputs or shard them differently.

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

Summary: This pull request adds 2D SPMD sharding to the table. It will shard both weights and activations. Here is the sharding strategy. Let's say we have a 2D mesh (data, model) and data x model == num_devices: 1. input (data,, None, model) 2. embedding (model, data) 3. attn QKV (data, model) 4. attn O (model, data) 5. mlp gate, up (model, data) 6. mlp down (data, model) 7. activation (data,, None, model) Currently you can specify the model dimension using a new option --spmd_2d_sharding, then the data dimension will be auto-calculated. TODO: maybe we should have another option to specify whether or not we should shard the activations/inputs or shard them differently.

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

alanwaketan added 4 commits August 1, 2023 04:55

Enable 2D parameter 1D activation

348450e

fix

ebe4e2e

shard activations and inputs

1cf52fe

Pass the 2d sharding config to the actual model

c90ead6

alanwaketan requested review from miladm, jonb377 and JackCaoG August 1, 2023 19:39

jonb377 approved these changes Aug 1, 2023

View reviewed changes

Use hybrid mesh and fix gate, up, down

70a798f

Fix comments

012ae0c

alanwaketan merged commit 813af25 into llama2-google-next-training Aug 1, 2023

alanwaketan mentioned this pull request Aug 2, 2023

Guard 2D sharding for activations and inputs #18

Merged

alanwaketan added a commit that referenced this pull request Aug 2, 2023

Guard 2D sharding for activations and inputs (#18)

0a9c9e0

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

alanwaketan added a commit that referenced this pull request Oct 27, 2023

Guard 2D sharding for activations and inputs (#18)

78f0c1c

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

yeounoh pushed a commit that referenced this pull request Mar 19, 2024

Guard 2D sharding for activations and inputs (#18)

4d898b0

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

vanbasten23 pushed a commit that referenced this pull request May 21, 2024

Guard 2D sharding for activations and inputs (#18)

be45f50

Summary: This pull requests fix a bug in #17 where it forgot to guard 2D sharding for activations and inputs. Test Plan: N/A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable 2D sharding #17

Enable 2D sharding #17

alanwaketan commented Aug 1, 2023 •

edited

Loading

jonb377 left a comment

jonb377 Aug 1, 2023 •

edited

Loading

alanwaketan Aug 1, 2023

alanwaketan Aug 1, 2023

jonb377 Aug 1, 2023

alanwaketan Aug 1, 2023

jonb377 Aug 1, 2023

alanwaketan Aug 1, 2023

alanwaketan commented Aug 1, 2023

		data_model_mesh = xs.Mesh(device_ids, (data, mod))
		model_data_mesh = xs.Mesh(device_ids, (mod, data))

Enable 2D sharding #17

Enable 2D sharding #17

Conversation

alanwaketan commented Aug 1, 2023 • edited Loading

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

alanwaketan Aug 1, 2023

Choose a reason for hiding this comment

alanwaketan Aug 1, 2023

Choose a reason for hiding this comment

jonb377 Aug 1, 2023

Choose a reason for hiding this comment

alanwaketan Aug 1, 2023

Choose a reason for hiding this comment

jonb377 Aug 1, 2023

Choose a reason for hiding this comment

alanwaketan Aug 1, 2023

Choose a reason for hiding this comment

alanwaketan commented Aug 1, 2023

alanwaketan commented Aug 1, 2023 •

edited

Loading

jonb377 Aug 1, 2023 •

edited

Loading