-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T5 #149
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you.
I left a few comments, but all of them are code-organization suggestions.
One question, how do you make sure your implementation is correct? The way I do this is I make sure the output of LED-T5 perfectly matches that of T5 for a random short input, say of size 4x256
# this is for the T5 setting | ||
if "has_relative_attention_bias" in config.to_dict(): | ||
self.is_decoder = config.is_decoder | ||
self.relative_attention_num_buckets = config.relative_attention_num_buckets | ||
self.has_relative_attention_bias = config.has_relative_attention_bias | ||
if self.has_relative_attention_bias: | ||
self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.num_heads) | ||
self.is_t5 = True | ||
else: | ||
self.is_t5 = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving all the T5-specific code from here to a longformer_encoder_decoder.LongformerSelfAttentionForT5
and have it inherit from LonformerSelfAttention
|
||
if self.is_t5: | ||
if position_bias is None: | ||
if not self.has_relative_attention_bias: | ||
raise ValueError("No position_bias provided and no weights to compute position_bias") | ||
|
||
position_bias = self.compute_bias(seq_len, seq_len) | ||
# if key and values are already calculated | ||
# we want only the last query position bias | ||
if past_key_value_state is not None: | ||
position_bias = position_bias[:, :, -1:, :] | ||
# TODO: attention_mask should also be the same shape as position_bias. | ||
# Sliding attention window?? | ||
# if attention_mask is not None: | ||
# position_bias = position_bias + attention_mask # (1, num_heads, seq_len, 2*window+1) | ||
attn_weights += position_bias | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, move to LongformerSelfAttentionForT5. Here you can only keep only one line, something like:
attn_weights = self.process_relative_positions(attn_weights)
. This function is empty in LongformerSelfAttention
but has more details in LongformerSelfAttentionForT5
relative_buckets += torch.where(is_small, relative_position, relative_postion_if_large) | ||
return relative_buckets | ||
|
||
def compute_bias(self, qlen, klen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qlen, klen are not used
|
||
def compute_bias(self, qlen, klen): | ||
""" Compute binned relative position bias """ | ||
relative_position = torch.tensor([[i-self.attention_window for i in range(2*self.attention_window+1)]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment to explain the change
@staticmethod | ||
def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128): | ||
""" | ||
Adapted from Mesh Tensorflow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is copied with no change, right? Please mention that.
Hello @AkshitaB, I am keeping my fingers crossed for You in porting T5 to use LongformerSelfAttention. I've tried to run code You have uploaded and it didn't work for me.
I've received this error: $ CUDA_VISIBLE_DEVICES=6 python3 convert_t5_to_longformerencoderdecoder.py --save_model_to ./
INFO:__main__:saving model to ./
Some weights of LongformerEncoderDecoderForConditionalGenerationT5 were not initialized from the model checkpoint at ./ and are newly initialized: ['encoder.block.0.layer.0.SelfAttention.longformer_self_attn.query.bias', (...)]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "convert_t5_to_longformerencoderdecoder.py", line 148, in <module>
main()
File "convert_t5_to_longformerencoderdecoder.py", line 140, in main
logits = model(input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, use_cache=False)[0]
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/transformers/modeling_t5.py", line 1151, in forward
encoder_outputs = self.encoder(
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/transformers/modeling_t5.py", line 775, in forward
position_bias = layer_outputs[3 if output_attentions else 2]
IndexError: tuple index out of range |
@adamwawrzynski, thanks for your interest in this work. Can you debug this a bit and see why it is breaking? Looks like a misconfiguration for |
I've checked dimensions of ...
for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_outputs = layer_module(
hidden_states,
attention_mask=extended_attention_mask,
position_bias=position_bias,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
encoder_decoder_position_bias=encoder_decoder_position_bias,
head_mask=head_mask[i],
past_key_value_state=past_key_value_state,
use_cache=use_cache,
output_attentions=output_attentions,
)
print(type(layer_outputs))
print(len(layer_outputs))
print(type(layer_outputs[0]))
... results in: <class 'tuple'>
2
<class 'torch.Tensor'> And later in code at line |
Hello. Model - https://huggingface.co/cointegrated/rut5-small python3 convert_t5_to_longformerencoderdecoder.py --base_model /mnt/1tb/ML/models/ruT5/small/ --save_model_to . |
No description provided.