Add cross reference

Lightning-AI · Dec 1, 2020 · d2a78a2 · d2a78a2
1 parent 58d7249
commit d2a78a2
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 12 deletions.
diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst
@@ -600,6 +600,8 @@ If you also need to use your own DDP implementation, override:  :meth:`pytorch_l
 
 ----------
 
+.. _model-parallelism:
+
 Model Parallelism [BETA]
 ------------------------
 
@@ -621,10 +623,10 @@ however the implementation is built from the ground up to be pytorch compatible
 Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs.
 This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
 
-The benefits are variable by model, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
+The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
 these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
 
-It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial.
+It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models).
 Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
 This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.
 

diff --git a/docs/source/performance.rst b/docs/source/performance.rst
@@ -127,16 +127,7 @@ provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
 When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
 This is due to efficient communication and parallelization under the hood.
 
-To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
-
-.. code-block:: bash
-
-    pip install https://github.com/facebookresearch/fairscale/archive/master.zip
-
-.. code-block:: python
-
-    # train using Sharded DDP
-    trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
+To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.
 
 Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.