Investigate and implement Flash Attention #5

0x000011b · 2023-02-19T02:15:22Z

For our use-case of fine-tuning LMs on up to 2048 tokens, flash attention might get us a ~2-4x speedup and a VRAM usage reduction of up to 10x. Sounds pretty amazing, so I'd like to give it a shot. Some code inspiration:

diffusers PR 532: shows how to use flash attention through xformers, probably the least painful way to go about it
GPT-NeoX PR 725: flash attention implementation in Eleuther's fork of Megatron
Official repo

0x000011b · 2023-03-06T02:04:33Z

Me and @TearGosling played around with optimizing NeoX by using components from xFormers after I profiled the training code. Results:

Usage of the MLP component there causes performance drop and lots of warnings, so we're ignoring this one
Usage of flash attention results in:
- 👍 A decent speedup (~17% IIRC)
- 👎 A significant increase in training and evaluation loss
- 😐 No noticeable VRAM savings
Usage of the rotary embedding implementation results in:
- 👍 A decent speedup (when used together with flash attention, gets us over a 20% throughput increase IIRC)
- 😐 A very minor increase in training loss

Next steps:

Figure out whether the higher loss values with flash attention actually translate to a worse model, or whether they can be accounted for by some other factor we're unaware of.
No noticeable VRAM savings with flash attention doesn't make a lot of sense. Investigate whether this is due to e.g. memory fragmentation (since we're not properly pre-allocating memory for dynamic tensor sizes).

0x000011b · 2023-05-02T15:05:11Z

Flash attention via xFormers landed on #8. "Lack" of VRAM savings was actually just fragmentation messing us up.

0x000011b added this to AI/ML Model Backlog Feb 19, 2023

0x000011b converted this from a draft issue Feb 19, 2023

0x000011b added the enhancement New feature or request label Feb 19, 2023

0x000011b moved this from 📋 Backlog to 🏗 In progress in AI/ML Model Backlog Mar 3, 2023

0x000011b assigned TearGosling and 0x000011b Mar 3, 2023

0x000011b closed this as completed May 2, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in AI/ML Model Backlog May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and implement Flash Attention #5

Investigate and implement Flash Attention #5

0x000011b commented Feb 19, 2023

0x000011b commented Mar 6, 2023

0x000011b commented May 2, 2023

Investigate and implement Flash Attention #5

Investigate and implement Flash Attention #5

Comments

0x000011b commented Feb 19, 2023

0x000011b commented Mar 6, 2023

0x000011b commented May 2, 2023