fixes to readme and tox

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
foundation-model-stack · Aug 23, 2024 · cd9db22 · cd9db22
1 parent e00fcd0
commit cd9db22
Show file tree

Hide file tree

Showing 4 changed files with 29 additions and 8 deletions.
diff --git a/plugins/accelerated-moe/README.md b/plugins/accelerated-moe/README.md
@@ -12,14 +12,37 @@ Plugin | Description | Depends | Loading | Augmentation | Callbacks
 
 ## Running Benchmarks
 
+Run the below in the top-level directory of this repo:
+- the `megablocks` dep is not included by default, so the `-x` switch installs it.
+
+```
+tox -e run-benches \
+    -x testenv:run-benches.deps+="-r plugins/accelerated-moe/requirements-mb.txt" \
+    -- \
+    8 8 benchmark_outputs scenarios.yaml accelerated-moe-megablocks
+
+```
+
+NOTE: if `FileNotFoundError` is observed on the *triton cache*, similar to issues like these:
+- https://github.com/triton-lang/triton/issues/2688
+
+then somehow `tox` is causing problems with triton and multiprocessing (there is some race condition).
+But the workaound is to first *activate the tox env* and 
+running in `bash`:
 ```
-tox -e run-benches -- 8 8 scenarios.yaml accelerated-moe-megablocks
+# if FileNotFoundError in the triton cache is observed
+# - then activate the env and run the script manually
+
+source .tox/run-benches/bin/activate
+bash scripts/run_benchmarks.sh \
+    8 8 benchmark_outputs scenarios.yaml accelerated-moe-megablocks
 ```
 
+
 ## Expert-Parallel MoE with Megablocks
 
 Not all of the features of `megablocks` are being incorporated; listing down some of the restrictions of the current integration:
-- curretnly not passing the data parallel `dp_mesh` to the `FSDP` constructor, so `FSDP` will always shard over the default process group (over world_size).
+- currently not passing the data parallel `dp_mesh` to the `FSDP` constructor, so `FSDP` will always shard over the default process group (over world_size).
 - now support only loading *sharded* `safetensor` non-GGUF MoE checkpoints. This is a reasonable assumption since MoE checkpoints are typically above the size limit that prevents it being saved into a single checkpoint filed.
 - only supports the *dropless sparse* MLPs in the megablocks package; the other variations like non-dropless and grouped computes are not currently integrated.
 - the `shard_moe` may not scale well with larger models as the current implementation `torch.concat` all the expert weights together before passing to `torch.distributed` to be sharded. This is redundently done in all devices, so it is inefficient.
@@ -34,5 +57,5 @@ Currently databricks megablocks does not have a PyPi repository and no proper re
 ```
 # this will install the megablocks from Github
 # megablocks requires CUDA Toolkit to build.
-pip install -r requirements_mb.txt
+pip install -r requirements-mb.txt
 ```
diff --git a/plugins/accelerated-moe/requirements-mb.txt b/plugins/accelerated-moe/requirements-mb.txt
@@ -0,0 +1,3 @@
+megablocks @ git+https://github.com/databricks/megablocks.git@bce5d7b2aaf5038bc93b36f76c2baf51c2939bd2
+
+# auto_gptq @ git+https://github.com/AutoGPTQ/AutoGPTQ.git@ea829c7bbe83561c2b1de26795b6592992373ef7
diff --git a/plugins/accelerated-moe/src/fms_acceleration_moe/requirements-mb.txt b/plugins/accelerated-moe/src/fms_acceleration_moe/requirements-mb.txt
diff --git a/tox.ini b/tox.ini
@@ -41,10 +41,6 @@ commands =
     python -m fms_acceleration.cli install -e {toxinidir}/plugins/attention_and_distributed_packing
     python -m fms_acceleration.cli install -e {toxinidir}/plugins/accelerated-moe
 
-    # need to install some optional dependencies
-    # - the megablocks dependency
-    pip install -r {toxinidir}/plugins/accelerated-moe/requirements-mb.txt
-
     # run the benchmark script
     bash scripts/run_benchmarks.sh {posargs:"1 2" "4 8" benchmark_outputs}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		megablocks @ git+https://github.com/databricks/megablocks.git@bce5d7b2aaf5038bc93b36f76c2baf51c2939bd2

		# auto_gptq @ git+https://github.com/AutoGPTQ/AutoGPTQ.git@ea829c7bbe83561c2b1de26795b6592992373ef7