Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in reformer: Reformer doesn't depend on its key feature -- LSHSelfAttention #16972

Closed
2 of 4 tasks
leo-liuzy opened this issue Apr 27, 2022 · 4 comments
Closed
2 of 4 tasks
Labels

Comments

@leo-liuzy
Copy link

leo-liuzy commented Apr 27, 2022

System Info

- `transformers` version: 4.19.0.dev0
- Platform: Linux-5.4.0-81-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.4.0
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

conda create -n reformer-issue python=3.8 -y

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install -e .  # install from source

python check_reformer.py

Make file changes (very minimal changes) as my PR here: leo-liuzy#2
Changes are located here

I made my fork from huggingface main two days ago.

I also play with removing LocalSelfAttention and the perplexity greatly improve especially with long_inputs_lst (in the file). When just using LSHSelfAttention, increase num_hash doesn't help.

My question is: could this be caused by an innocent bug in transferring from Reformer's official code? Or is this intrinsic to the reformer?
I know in reformer they had a 20-layer transformer trained with 20 LSHSelfAttention and it shows good performance; that's why it further confused me.

Expected behavior

With `weight = 0 if isinstance(self.attention.self_attention, LSHSelfAttention) else 1`
No. hash: 1
Seq_len(43)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.614
ppl: 6.123

Seq_len(85)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 3.808
ppl: 14.006

Seq_len(135)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.230
ppl: 4.693

Seq_len(53)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.261
ppl: 4.792

Seq_len(47)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.646
ppl: 6.258

Seq_len(78)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.347
ppl: 5.087

Seq_len(26)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.712
ppl: 6.553

Seq_len(63)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 3.568
ppl: 11.858

Seq_len(147)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 0
bpd: 2.983
ppl: 7.907


With `weight = 1 if isinstance(self.attention.self_attention, LSHSelfAttention) else 1`
No. hash: 1
Seq_len(43)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.614
ppl: 6.123

Seq_len(85)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 3.808
ppl: 14.006

Seq_len(135)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.218
ppl: 4.651

Seq_len(53)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.261
ppl: 4.792

Seq_len(47)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.646
ppl: 6.258

Seq_len(78)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.347
ppl: 5.087

Seq_len(26)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.712
ppl: 6.553

Seq_len(63)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 3.568
ppl: 11.858

Seq_len(147)
Using LSHAttn: 
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LocalSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
<class 'transformers.models.reformer.modeling_reformer.LSHSelfAttention'>: Y_1 = X_1 + f(X_2) * 1
bpd: 2.973
ppl: 7.850
@leo-liuzy leo-liuzy added the bug label Apr 27, 2022
@leo-liuzy leo-liuzy changed the title Is reformer code buggy? Issue in reformer Apr 27, 2022
@leo-liuzy leo-liuzy changed the title Issue in reformer Issue in reformer: reformer doesn't depend on its keep feature -- LSHSelfAttention Apr 27, 2022
@leo-liuzy leo-liuzy changed the title Issue in reformer: reformer doesn't depend on its keep feature -- LSHSelfAttention Issue in reformer: Reformer doesn't depend on its key feature -- LSHSelfAttention Apr 27, 2022
@leo-liuzy leo-liuzy mentioned this issue Apr 27, 2022
42 tasks
@patrickvonplaten
Copy link
Contributor

Hey @leo-liuzy,

Sorry what exactly is the issue here with Reformer? Is the training not working?

@leo-liuzy
Copy link
Author

leo-liuzy commented Apr 28, 2022

Hi @patrickvonplaten , I am evaluating released model trained on crime and punishment (with examples randomly grabbed from crime and punishment). I found if I remove LSHSelfAttention output from producing perplexity. The perplexity doesn't change much. But if I remove LocalSelfAttention, the PPL goes up by a lot. So, I wonder if this is caused a bug (even during training) in codebase, or it's intrinsic to the specific reformer's model structure -- (attn_layers = ["lsh", "local", "lsh", "local", "lsh", "local"])

@patrickvonplaten
Copy link
Contributor

I'm not really sure @leo-liuzy sadly - I've never removed the local layers when training the model. Maybe you can also try asking on https://discuss.huggingface.co/ :-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants