update readme with note on mixed precision

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
foundation-model-stack · Aug 27, 2024 · a3c86ae · a3c86ae
1 parent 59907e3
commit a3c86ae
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/plugins/accelerated-moe/README.md b/plugins/accelerated-moe/README.md
@@ -50,6 +50,7 @@ Not all of the features of `megablocks` are being incorporated; listing down som
 - only supports the *dropless sparse* MLPs in the megablocks package; the other variations like non-dropless and grouped computes are not currently integrated.
 - the `shard_moe` may not scale well with larger models as the current implementation `torch.concat` all the expert weights together before passing to `torch.distributed` to be sharded. This is redundently done in all devices, so it is inefficient.
 - currently only supports `StateDictType.SHARDED_STATE_DICT` because the implementation uses `DTensors` which have limited support for full state dicts. However for efficiency considerations, sharded state dicts are the most efficient. 
+- currently may not support *mixed precision* properly; need to ascertain more clearly how the sharded `DTensors` are upcasted in the optimizer (if at all).
 
 ### Megablocks Dependencies