Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLNet-large-cased on Squad 2.0: can't replicate results #822

Closed
avisil opened this issue Jul 18, 2019 · 11 comments
Closed

XLNet-large-cased on Squad 2.0: can't replicate results #822

avisil opened this issue Jul 18, 2019 · 11 comments
Labels

Comments

@avisil
Copy link

avisil commented Jul 18, 2019

I've been trying to replicate the numbers in the Squad 2.0 dev set (F1=86) with this script and the XLnet embeddings. So far the results are really off..{Opening a new issue as the previous one seems dedicated to SST-2}

python run_squad.py --do_lower_case --do_train --do_eval --train_file $SQUAD_DIR/train-v2.0.json --predict_file $SQUAD_DIR/dev-v2.0.json --output_dir $SQUAD_DIR/output --version_2_with_negative --model_name xlnet-large-cased --save_steps 5000 --num_train_epochs 3 --overwrite_output_dir --model_type xlnet --per_gpu_train_batch_size 4 --gradient_accumulation_steps 1 --learning_rate 3e-5

gives:

07/18/2019 08:43:36 - INFO - __main__ - Results: {'exact': 3.217383980459867, 'f1': 7.001376535240158, 'total': 11873, 'HasAns_exact': 6.359649122807017, 'HasAns_f1': 13.938485762973412, 'HasAns_total': 5928, 'NoAns_exact': 0.08410428931875526, 'NoAns_f1': 0.08410428931875526, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

@avisil
Copy link
Author

avisil commented Jul 18, 2019

This is similar to what the authors ran in the paper (except I could fit only this on 3 v100 GPUs):

python run_squad.py --do_lower_case --do_train --do_eval --train_file $SQUAD_DIR/train-v2.0.json --predict_file $SQUAD_DIR/dev-v2.0.json --output_dir $SQUAD_DIR/output --version_2_with_negative --model_name xlnet-large-cased --save_steps 5000 --num_train_epochs 3 --overwrite_output_dir --model_type xlnet --per_gpu_train_batch_size 2 --gradient_accumulation_steps 1 --max_seq_length 512 --max_answer_length 64 --adam_epsilon 1e-6 --learning_rate 3e-5 --num_train_epochs 2

gives:

07/18/2019 06:20:54 - INFO - __main__ - Results: {'exact': 2.0382380190347846, 'f1': 6.232918462554391, 'total': 11873, 'HasAns_exact': 3.9979757085020244, 'HasAns_f1': 12.399365874815837, 'HasAns_total': 5928, 'NoAns_exact': 0.08410428931875526, 'NoAns_f1': 0.08410428931875526, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

@avisil
Copy link
Author

avisil commented Jul 22, 2019

@thomwolf are you already working on this? I can work with you to try to solve it :)

@Zhiyu-Chen
Copy link

with the same question... Also got weird results on other QA datasets like BoolQ, MultiRC.

@thomwolf
Copy link
Member

thomwolf commented Jul 23, 2019

@avisil not yet, I won't have time to work on this before ACL but you can start to have a look if you want. Such discrepancies pretty much always come, not from the model it-self but, from different settings for pre/post-processing the dataset or for the optimizer/optimization process.

If you want to start giving it a look, the way I usually check exact reproducibility on downstream tasks like GLUE/SQuAD is to directly import the pytorch-transformer's model in the tensorflow code (that's the main reason the library is python 2 compatible), load the pytorch model with the initialized tf model and run the models side by side on the same inputs (on separate GPUs) to check-in details the inputs/outputs/hidden-states and so-on. It's better to do it on a GPU version of the TF code so you can setup the optimizer your-self. I think somebody did a GPU version of the official SQuAD example, but you can also take inspiration from the multi-GPU adaptation I did of the TensorFlow code for GLUE, which is here: https://github.com/thomwolf/xlnet/blob/master/run_classifier_gpu.py.
In this fork, you can see how I import and run the PyTorch model along the TensorFlow one side by side.

In the case of SQuAD, I already know that they are a few differences which should be fixed:

  • the pre-processing of the dataset is not exactly the same (parsing and tokenization logic is a lot more complex in the XLNet repo),
  • XLNet was trained using discriminative learning (progressively decreasing learning rate along with the depth of the model).

@ntubertchen
Copy link

I found similar problem on GLEU dataset.

With the command:
python run_glue.py --data_dir=./glue_data/SST-2 --model_type=xlnet --task_name=sst-2 --output_dir=./xlnet_glue --model_name_or_path=xlnet-base-cased --do_train --evaluate_during_training

The final result of SST-2 is only 0.836, which is way lower than the current SoTA.

Does anyone have a clue how to solve it?

@thomwolf
Copy link
Member

thomwolf commented Aug 5, 2019

@ntubertchen good parameters for SST-2 are in the (adequately titled) issue #795

@ghost
Copy link

ghost commented Aug 21, 2019

I encountered similar problem with bert-large models. No luck yet.

@stale
Copy link

stale bot commented Oct 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 20, 2019
@stale stale bot closed this as completed Oct 27, 2019
@panl2015
Copy link

looks like xlnet for squad 2.0 is broken:

python run_squad.py --version_2_with_negative --cache_dir ${CACHE_DIR} \
--model_type xlnet --model_name_or_path xlnet-large-cased \
--do_train --train_file $SQUAD_DIR/train-v2.0.json \
--do_eval  --predict_file $SQUAD_DIR/dev-v2.0.json \
--gradient_accumulation_steps 4 --overwrite_output_dir \
--learning_rate "3e-5" --num_train_epochs 2 --max_seq_length 512 --doc_stride 128 \
--output_dir $SQUAD_DIR/output/" \
--fp16 --fp16_opt_level "O2" --per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 8 --weight_decay=0.00 --save_steps 20000 --adam_epsilon 1e-6

gives:

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16343 [00:00<?, ?it/s]�[ATraceback (most recent call last):
  File "examples/run_squad.py", line 830, in <module>
    main()
  File "examples/run_squad.py", line 769, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "examples/run_squad.py", line 221, in train
    inputs.update({"is_impossible": batch[7]})
IndexError: tuple index out of range

I added is_impossible to the features and dataloader, but the result was very low:

{'exact': 44.5717173418681, 'f1': 44.82239308319654, 'total': 11873, 'HasAns_exact': 0.0, 'HasAns_f1': 0.5020703570837503, 'HasAns_total': 5928, 'NoAns_exact': 89.01597981497056, 'NoAns_f1': 89.01597981497056, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

@LysandreJik
Copy link
Member

Thanks for reporting the bug @panl2015, should have been fixed with 073219b.

@panl2015
Copy link

panl2015 commented Jan 21, 2020

Thanks @LysandreJik ! I think that's how I fixed it locally to make it run but got the low result. Maybe I should try with your version to make sure I don't have other changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants