Skip to content

Commit

Permalink
update readme with note on mixed precision
Browse files Browse the repository at this point in the history
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
  • Loading branch information
fabianlim committed Aug 27, 2024
1 parent 59907e3 commit a3c86ae
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions plugins/accelerated-moe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Not all of the features of `megablocks` are being incorporated; listing down som
- only supports the *dropless sparse* MLPs in the megablocks package; the other variations like non-dropless and grouped computes are not currently integrated.
- the `shard_moe` may not scale well with larger models as the current implementation `torch.concat` all the expert weights together before passing to `torch.distributed` to be sharded. This is redundently done in all devices, so it is inefficient.
- currently only supports `StateDictType.SHARDED_STATE_DICT` because the implementation uses `DTensors` which have limited support for full state dicts. However for efficiency considerations, sharded state dicts are the most efficient.
- currently may not support *mixed precision* properly; need to ascertain more clearly how the sharded `DTensors` are upcasted in the optimizer (if at all).

### Megablocks Dependencies

Expand Down

0 comments on commit a3c86ae

Please sign in to comment.