Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded DDP Docs #4920

Merged
merged 24 commits into from
Dec 2, 2020
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 51 additions & 4 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -598,6 +598,53 @@ If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lig
If you also need to use your own DDP implementation, override: :meth:`pytorch_lightning.core.LightningModule.configure_ddp`.


----------

Model Parallelism [BETA]
------------------------

Model Parallelism tackles training large models on distributed systems, by modifying distributed communications and memory management of the model.
Unlike data parallelism, the model is partitioned in various ways across the GPUs, in most cases to reduce the memory overhead when training large models.
This is useful when dealing with large Transformer based models, or in environments where GPU memory is limited.

Lightning currently offers the following methods to leverage model parallelism:

- Optimizer Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead)

Optimizer Sharded Training
^^^^^^^^^^^^^^^^^^^^^^^^^^
Lightning integration of optimizer sharded training provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
however the implementation is built from the ground up to be pytorch compatible and standalone.

Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except optimizer state and gradients are sharded across GPUs.
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved
This means the memory overhead per GPU is lower, as each GPU only has to maintain a section of your optimizer state and gradients.
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved

The benefits are variable by model, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.

It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial.
Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.

To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.

.. code-block:: bash

pip install https://github.com/facebookresearch/fairscale/archive/master.zip


.. code-block:: python

# train using Sharded DDP
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')

Optimizer Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Internally we re-initialize optimizers, sharding the optimizers across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved


Batch size
----------
When using distributed training make sure to modify your learning rate according to your effective
Expand Down Expand Up @@ -640,16 +687,16 @@ The reason is that the full batch is visible to all GPUs on the node when using

----------

PytorchElastic
TorchElastic
--------------
Lightning supports the use of PytorchElastic to enable fault-tolerent and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of gpus you want to use in the trainer.
Lightning supports the use of TorchElastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of gpus you want to use in the trainer.

.. code-block:: python

Trainer(gpus=8, accelerator='ddp')


Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:

.. code-block:: bash

Expand All @@ -671,7 +718,7 @@ And then launch the elastic job with:
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)


See the official `PytorchElastic documentation <https://pytorch.org/elastic>`_ for details
See the official `TorchElastic documentation <https://pytorch.org/elastic>`_ for details
on installation and more use cases.

----------
Expand Down
27 changes: 27 additions & 0 deletions docs/source/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,30 @@ However, know that 16-bit and multi-processing (any DDP) can have issues. Here a
CUDA_LAUNCH_BLOCKING=1 python main.py

.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.

----------

Use Sharded DDP for GPU memory and scaling optimization
-------------------------------------------------------

Sharded DDP is a lightning integration of `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_
provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.

When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
This is due to efficient communication and parallelization under the hood.

To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.

.. code-block:: bash

pip install https://github.com/facebookresearch/fairscale/archive/master.zip

.. code-block:: python

# train using Sharded DDP
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')

Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved

Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.