-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom attention bias #617
base: main
Are you sure you want to change the base?
Conversation
An error occurred during an experiment with this code. |
@khj94 Thanks for reporting this. As I said, testing was very limited for now. Correct me if I'm wrong, but I think the error came when the sequence length is not a multiple of 8. For q,k,v it looks that the tensor are padded (here) to avoid this problem. I guess I need do the same for the bias tensor. I'll look into it. |
@b-albar Thank you for responding quickly. |
Actually, the way its implemented for qkv is that the padded tensors are returned in the forward and reinjected through the python interface in the backward. So no need to pad it twice. I added this in the last commit. Also I encountered some nan in dq,dk,dv and had to implement some condition for boundaries on row and columns similar to what was implemented in the alibi patch. But it look like I ran into some race conditions (due to branching ?), and I had to add a syncthread(). It looks ok now (hopefully) but I'll have to investigate more on that later. |
I added the gradients for the bias. The gradient is computed only in case of requires_grad in the bias. This avoid to copy a large matrix if it's not needed (like for alibi style bias or masking) as it's one of the main feature of FA to not have to copy this matrix. |
@b-albar I want to check the last commit, but an error occurs during the build process.
|
@khj94 My bad, should be good now! |
Just dropping in as an interested observer @b-albar to say: Thank you for your continued work on this. This PR is going to unlock some serious efficiency gains for T5 models, and I think it will lead to some great stuff. ❤️ |
* add support for shape (1,1,q,k) * move bias and ds on the same smem
Just echoing what others have said - extremely excited for this to land - let me know if there is anything I can do to help! |
I'll make a note here for anyone wanting to give this a try right now that all tests currently are failing. |
@c0nn3r Thanks for noticing this. There is clearly a memory error somewhere. It is very odd that it didn't show up in my local version. I usually compile with alibi and some other stuff disabled as it is faster for debugging but this error show up only when all options are enabled ! This is very odd, I'll investigate what's going on. In the mean time specifying "-DFLASHATTENTION_DISABLE_ALIBI" for compilation seems to do the trick (at least for me) as a temporary workaround. |
Thanks for looking into it. Hm, I just compiled it (7fe76b6) and I'm still seeing If I just run: I also attempted to bisect the issue, but I've found it appears even before the refactor (and it takes a while to build even on 40 CPU cores). |
I definitively didn't have a misaligned adress error previously. Could you post the pytest logs here ? I'd like to see which seqlen are failing. Also which GPU and cuda version do you use ? |
Ok after some experiments, I think we are dealing with a compiler issue here. That would explain why it work after disabling some options and why I didn't found it before. It look like the kernel is quite big and the compiler doesn't like it too much (just compiling the kernel with debug infos crash the compilation). I was using CUDA 12.3 update 1 previously and I could reproduce the memory error even while disabling alibi. But disabling alibi, uneven_k and local worked fine without any change in the code ! Updating CUDA to 12.3 update 2 works better (the changelog mention some register spilling issue and potential miscompilation - I don't know if it could be this) and now just only disabling alibi works again. I'll try to recompile the code with all the options to see. For now, I'll disable all options on this branch until it is fully clear what happens, and at least it will speed up the compilation. @c0nn3r did you use CUDA 12.3.1 to compile ? |
@b-albar I used CUDA 12.1 to compile. Let me update to CUDA 12.3 and try again. |
I get the same result as before using:
I don't think it is the compiler - both the flash attention and attention bias tests fail. My GPUs are all RTX A5000s |
Hi, there! I am willing to help with the testing. RTX 4090 When I tried it with my app, it continues to throw More Specifically:
Here is some warning happened during compiling which might be relevant (sorry I have no idea):
A100 PCIE However, when I deployed the model in my application (with latest branch), it fails with |
Finally someone did it! Does it also support negative bias (i.e mask)? |
I am currently testing this, i have implemented it for switchtransformers, and an NVIDIA A100 40GB. Training at the MLM task with an LR=1.5e-4. Loss scaling is very unstable. Lowering the LR and increasing the warmup_ratio seems to help but at some point something explodes and the destroys loss scaling. |
Thank you all for testing. For the @wehos @sentialx Yes it should work for masking by using a large negative bias |
@b-albar Sorry i didn't clarify, im using bf16, but the Switch architecture is tricky too with mixed-precision (they use selective mixed precision so some layers are bf16 and others are fp32), tried it again with the new commit on a subset of starcoder data(7B tokens) and reached 0.17 loss on epoch 0.32 which is a bit ridiculous on an MLM task |
Ok I think I see, you mean the loss going (too) quickly to zero ? In this case, maybe it's a causality issue. In this case it's better to force causality in the decoder using causal=True and not using the bias matrix for this. It look like the SwitchTransformers architecture is similar to T5, I'm working on a T5 with flash-attention, maybe it can help. The code is available here. |
Thanks a lot, i was actually looking for T5-like implementation for Flash, Switch are basically T5-MOE. I've also replaced the T5 RPE, with FIRE for bias which gave me better performance on different lengths. |
@b-albar Could this work with infinite negatives (custom attention mask), It seems like I have to reshape the array to (batch_size, num_heads, seq_len, seq_len). It would be good if broadcasting is performed along the |
This PR is an attempt to add custom (additive) attention biases. This is still very much a work in progress to say the least but I though to make my code available as there may be a lot of interest for this feature (#332, #342, #534, #606, #661).
For now, there is no gradients for the bias matrix (I'm working on this)and it has been through very limited testing. But the forward pass as well as dq, dv, dk looks ok (tested only on A100). Also there has been no time spent on optimizing stuff.I'm not familiar working with kernels so any comments or help for this PR would be very much welcome !
Todo