Skip to content

Commit

Permalink
Document sequential CPU offload method on Stable Diffusion pipeline (#…
Browse files Browse the repository at this point in the history
…1024)

* document cpu offloading method

* address review comments

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
  • Loading branch information
piEsposito and patrickvonplaten authored Oct 27, 2022
1 parent a6314a8 commit de00c63
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 7 deletions.
64 changes: 57 additions & 7 deletions docs/source/optimization/fp16.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,20 @@ specific language governing permissions and limitations under the License.

We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.


| | Latency | Speedup |
|------------------|---------|---------|
| ---------------- | ------- | ------- |
| original | 9.50s | x1 |
| cuDNN auto-tuner | 9.37s | x1.01 |
| autocast (fp16) | 5.47s | x1.91 |
| fp16 | 3.61s | x2.91 |
| channels last | 3.30s | x2.87 |
| traced UNet | 3.21s | x2.96 |

<em>obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps.</em>
<em>
obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from
the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM
steps.
</em>

## Enable cuDNN auto-tuner

Expand Down Expand Up @@ -61,7 +64,7 @@ pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
image = pipe(prompt).images[0]
image = pipe(prompt).images[0]
```

Despite the precision loss, in our experience the final image results look the same as the `float32` versions. Feel free to experiment and report back!
Expand All @@ -79,15 +82,18 @@ pipe = StableDiffusionPipeline.from_pretrained(
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image = pipe(prompt).images[0]
```

## Sliced attention for additional memory savings

For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.

<Tip>
Attention slicing is useful even if a batch size of just 1 is used - as long as the model uses more than one attention head. If there is more than one attention head the *QK^T* attention matrix can be computed sequentially for each head which can save a significant amount of memory.
Attention slicing is useful even if a batch size of just 1 is used - as long
as the model uses more than one attention head. If there is more than one
attention head the *QK^T* attention matrix can be computed sequentially for
each head which can save a significant amount of memory.
</Tip>

To perform the attention computation sequentially over each head, you only need to invoke [`~StableDiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
Expand All @@ -105,11 +111,55 @@ pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]
image = pipe(prompt).images[0]
```

There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!

## Offloading to CPU with accelerate for memory savings

For additional memory savings, you can offload the weights to CPU and load them to GPU when performing the forward pass.

To perform CPU offloading, all you have to do is invoke [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
revision="fp16",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]
```

And you can get the memory consumption to < 2GB.

If is also possible to chain it with attention slicing for minimal memory consumption, running it in as little as < 800mb of GPU vRAM:

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
revision="fp16",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing(1)

image = pipe(prompt).images[0]
```

## Using Channels Last memory format

Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,11 @@ def disable_attention_slicing(self):
self.enable_attention_slicing(None)

def enable_sequential_cpu_offload(self):
r"""
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, unet,
text_encoder, vae and safety checker have their state dicts saved to CPU and then are moved to a
`torch.device('meta') and loaded to GPU only when their specific submodule has its `forward` method called.
"""
if is_accelerate_available():
from accelerate import cpu_offload
else:
Expand Down

1 comment on commit de00c63

@dblunk88
Copy link
Contributor

@dblunk88 dblunk88 commented on de00c63 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try that I get the error: AttributeError: 'NoneType' object has no attribute 'state_dict'

specifically happens when safety_checker = None
there needs to be a check in the stable_diffusion pipeline that checks if safety_checker is existent or not

additionally, it defaults to CUDA (I think). Might be a good idea to also support multi-GPUs by passing the CUDA device specifically

Please sign in to comment.