diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst index 28c64b3ee8d9e..5dda8192dcb8f 100644 --- a/docs/source/multi_gpu.rst +++ b/docs/source/multi_gpu.rst @@ -600,6 +600,8 @@ If you also need to use your own DDP implementation, override: :meth:`pytorch_l ---------- +.. _model-parallelism: + Model Parallelism [BETA] ------------------------ @@ -621,10 +623,10 @@ however the implementation is built from the ground up to be pytorch compatible Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs. This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients. -The benefits are variable by model, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication, +The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication, these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups. -It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial. +It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models). Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important. This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful. diff --git a/docs/source/performance.rst b/docs/source/performance.rst index 78db0cfd78150..0f97942128cda 100644 --- a/docs/source/performance.rst +++ b/docs/source/performance.rst @@ -127,16 +127,7 @@ provided by `Fairscale `_. When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP. This is due to efficient communication and parallelization under the hood. -To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``. - -.. code-block:: bash - - pip install https://github.com/facebookresearch/fairscale/archive/master.zip - -.. code-block:: python - - # train using Sharded DDP - trainer = Trainer(accelerator='ddp', plugins='ddp_sharded') +To use Optimizer Sharded Training, refer to :ref:`model-parallelism`. Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.