Skip to content

Commit

Permalink
Add cross reference
Browse files Browse the repository at this point in the history
  • Loading branch information
SeanNaren committed Dec 1, 2020
1 parent 58d7249 commit d2a78a2
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 12 deletions.
6 changes: 4 additions & 2 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,8 @@ If you also need to use your own DDP implementation, override: :meth:`pytorch_l

----------

.. _model-parallelism:

Model Parallelism [BETA]
------------------------

Expand All @@ -621,10 +623,10 @@ however the implementation is built from the ground up to be pytorch compatible
Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs.
This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.

The benefits are variable by model, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.

It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial.
It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models).
Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.

Expand Down
11 changes: 1 addition & 10 deletions docs/source/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,16 +127,7 @@ provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
This is due to efficient communication and parallelization under the hood.

To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.

.. code-block:: bash
pip install https://github.com/facebookresearch/fairscale/archive/master.zip
.. code-block:: python
# train using Sharded DDP
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.

Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Expand Down

0 comments on commit d2a78a2

Please sign in to comment.