fix: Allow for multi node training for accelerated moe #129

kmehant · 2025-02-23T19:13:52Z

Current implementation uses global rank of the process to prepare the device index which would not work in a multi node setting. Therefore, we would need to use local rank since devices are not continuously indexed across the nodes.

fabianlim · 2025-02-24T02:54:29Z

@kmehant i understand the fix, but can you update the description for record keeping purposes

kmehant · 2025-02-24T03:23:18Z

#129 (comment)

@fabianlim
Apologies for missing that, have added it.

fabianlim

LGTM but one suggestion

fabianlim · 2025-02-24T04:01:12Z

plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py

@@ -65,7 +66,7 @@ def augmentation(
        rank, world_size = 0, 1
        if torch.distributed.is_initialized():
            world_size = torch.distributed.get_world_size()
-            rank = torch.distributed.get_rank()
+            rank = int(os.environ["LOCAL_RANK"])


can we make it consistent and follow the new style.

Suggested change

rank = int(os.environ["LOCAL_RANK"])

# we do not need to use the fallback as this is wrapped in an `is_initialized` block

rank = torch.distributed.get_node_local_rank()

@fabianlim Have included this suggestion thanks.

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2025-02-24T07:00:54Z

@fabianlim requesting your merge.

fabianlim · 2025-02-24T07:19:55Z

@kmehant lets have @willmj look at it first

kmehant changed the title ~~Allow for multi node training for accelerated moe~~ fix: Allow for multi node training for accelerated moe Feb 23, 2025

kmehant marked this pull request as ready for review February 23, 2025 19:14

kmehant requested a review from fabianlim as a code owner February 23, 2025 19:14

fabianlim requested a review from willmj February 24, 2025 02:55

fabianlim approved these changes Feb 24, 2025

View reviewed changes

kmehant force-pushed the mn-sharedmoe-final branch 3 times, most recently from 1bb2f8c to 548b710 Compare February 24, 2025 06:43

fix: compute device correctly

ed7821d

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the mn-sharedmoe-final branch from 548b710 to ed7821d Compare February 24, 2025 06:45

fabianlim approved these changes Feb 27, 2025

View reviewed changes

fabianlim merged commit 791bdd9 into foundation-model-stack:main Feb 27, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow for multi node training for accelerated moe #129

fix: Allow for multi node training for accelerated moe #129

kmehant commented Feb 23, 2025 •

edited

Loading

fabianlim commented Feb 24, 2025

kmehant commented Feb 24, 2025 •

edited

Loading

fabianlim left a comment

fabianlim Feb 24, 2025

kmehant Feb 24, 2025 •

edited

Loading

kmehant commented Feb 24, 2025

fabianlim commented Feb 24, 2025

	rank = int(os.environ["LOCAL_RANK"])
	# we do not need to use the fallback as this is wrapped in an `is_initialized` block
	rank = torch.distributed.get_node_local_rank()

fix: Allow for multi node training for accelerated moe #129

fix: Allow for multi node training for accelerated moe #129

Conversation

kmehant commented Feb 23, 2025 • edited Loading

fabianlim commented Feb 24, 2025

kmehant commented Feb 24, 2025 • edited Loading

fabianlim left a comment

Choose a reason for hiding this comment

fabianlim Feb 24, 2025

Choose a reason for hiding this comment

kmehant Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

kmehant commented Feb 24, 2025

fabianlim commented Feb 24, 2025

kmehant commented Feb 23, 2025 •

edited

Loading

kmehant commented Feb 24, 2025 •

edited

Loading

kmehant Feb 24, 2025 •

edited

Loading