XLNet-large-cased on Squad 2.0: can't replicate results #822

avisil · 2019-07-18T14:09:31Z

I've been trying to replicate the numbers in the Squad 2.0 dev set (F1=86) with this script and the XLnet embeddings. So far the results are really off..{Opening a new issue as the previous one seems dedicated to SST-2}

python run_squad.py --do_lower_case --do_train --do_eval --train_file $SQUAD_DIR/train-v2.0.json --predict_file $SQUAD_DIR/dev-v2.0.json --output_dir $SQUAD_DIR/output --version_2_with_negative --model_name xlnet-large-cased --save_steps 5000 --num_train_epochs 3 --overwrite_output_dir --model_type xlnet --per_gpu_train_batch_size 4 --gradient_accumulation_steps 1 --learning_rate 3e-5

gives:

07/18/2019 08:43:36 - INFO - __main__ - Results: {'exact': 3.217383980459867, 'f1': 7.001376535240158, 'total': 11873, 'HasAns_exact': 6.359649122807017, 'HasAns_f1': 13.938485762973412, 'HasAns_total': 5928, 'NoAns_exact': 0.08410428931875526, 'NoAns_f1': 0.08410428931875526, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

The text was updated successfully, but these errors were encountered:

avisil · 2019-07-18T14:12:10Z

This is similar to what the authors ran in the paper (except I could fit only this on 3 v100 GPUs):

python run_squad.py --do_lower_case --do_train --do_eval --train_file $SQUAD_DIR/train-v2.0.json --predict_file $SQUAD_DIR/dev-v2.0.json --output_dir $SQUAD_DIR/output --version_2_with_negative --model_name xlnet-large-cased --save_steps 5000 --num_train_epochs 3 --overwrite_output_dir --model_type xlnet --per_gpu_train_batch_size 2 --gradient_accumulation_steps 1 --max_seq_length 512 --max_answer_length 64 --adam_epsilon 1e-6 --learning_rate 3e-5 --num_train_epochs 2

gives:

07/18/2019 06:20:54 - INFO - __main__ - Results: {'exact': 2.0382380190347846, 'f1': 6.232918462554391, 'total': 11873, 'HasAns_exact': 3.9979757085020244, 'HasAns_f1': 12.399365874815837, 'HasAns_total': 5928, 'NoAns_exact': 0.08410428931875526, 'NoAns_f1': 0.08410428931875526, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

avisil · 2019-07-22T15:56:24Z

@thomwolf are you already working on this? I can work with you to try to solve it :)

Zhiyu-Chen · 2019-07-23T09:04:00Z

with the same question... Also got weird results on other QA datasets like BoolQ, MultiRC.

thomwolf · 2019-07-23T11:12:19Z

@avisil not yet, I won't have time to work on this before ACL but you can start to have a look if you want. Such discrepancies pretty much always come, not from the model it-self but, from different settings for pre/post-processing the dataset or for the optimizer/optimization process.

If you want to start giving it a look, the way I usually check exact reproducibility on downstream tasks like GLUE/SQuAD is to directly import the pytorch-transformer's model in the tensorflow code (that's the main reason the library is python 2 compatible), load the pytorch model with the initialized tf model and run the models side by side on the same inputs (on separate GPUs) to check-in details the inputs/outputs/hidden-states and so-on. It's better to do it on a GPU version of the TF code so you can setup the optimizer your-self. I think somebody did a GPU version of the official SQuAD example, but you can also take inspiration from the multi-GPU adaptation I did of the TensorFlow code for GLUE, which is here: https://github.com/thomwolf/xlnet/blob/master/run_classifier_gpu.py.
In this fork, you can see how I import and run the PyTorch model along the TensorFlow one side by side.

In the case of SQuAD, I already know that they are a few differences which should be fixed:

the pre-processing of the dataset is not exactly the same (parsing and tokenization logic is a lot more complex in the XLNet repo),
XLNet was trained using discriminative learning (progressively decreasing learning rate along with the depth of the model).

ntubertchen · 2019-07-29T09:56:45Z

I found similar problem on GLEU dataset.

With the command:
python run_glue.py --data_dir=./glue_data/SST-2 --model_type=xlnet --task_name=sst-2 --output_dir=./xlnet_glue --model_name_or_path=xlnet-base-cased --do_train --evaluate_during_training

The final result of SST-2 is only 0.836, which is way lower than the current SoTA.

Does anyone have a clue how to solve it?

thomwolf · 2019-08-05T09:20:59Z

@ntubertchen good parameters for SST-2 are in the (adequately titled) issue #795

ghost · 2019-08-21T00:37:50Z

I encountered similar problem with bert-large models. No luck yet.

stale · 2019-10-20T01:44:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

panl2015 · 2020-01-20T19:31:32Z

looks like xlnet for squad 2.0 is broken:

python run_squad.py --version_2_with_negative --cache_dir ${CACHE_DIR} \
--model_type xlnet --model_name_or_path xlnet-large-cased \
--do_train --train_file $SQUAD_DIR/train-v2.0.json \
--do_eval  --predict_file $SQUAD_DIR/dev-v2.0.json \
--gradient_accumulation_steps 4 --overwrite_output_dir \
--learning_rate "3e-5" --num_train_epochs 2 --max_seq_length 512 --doc_stride 128 \
--output_dir $SQUAD_DIR/output/" \
--fp16 --fp16_opt_level "O2" --per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 8 --weight_decay=0.00 --save_steps 20000 --adam_epsilon 1e-6

gives:

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16343 [00:00<?, ?it/s]�[ATraceback (most recent call last):
  File "examples/run_squad.py", line 830, in <module>
    main()
  File "examples/run_squad.py", line 769, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "examples/run_squad.py", line 221, in train
    inputs.update({"is_impossible": batch[7]})
IndexError: tuple index out of range

I added is_impossible to the features and dataloader, but the result was very low:

{'exact': 44.5717173418681, 'f1': 44.82239308319654, 'total': 11873, 'HasAns_exact': 0.0, 'HasAns_f1': 0.5020703570837503, 'HasAns_total': 5928, 'NoAns_exact': 89.01597981497056, 'NoAns_f1': 89.01597981497056, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

LysandreJik · 2020-01-21T16:26:59Z

Thanks for reporting the bug @panl2015, should have been fixed with 073219b.

panl2015 · 2020-01-21T18:55:49Z

Thanks @LysandreJik ! I think that's how I fixed it locally to make it run but got the low result. Maybe I should try with your version to make sure I don't have other changes.

astariul mentioned this issue Aug 2, 2019

[XLNet] Parameters to reproduce SQuAD scores #947

Closed

fmikaelian mentioned this issue Sep 15, 2019

[WIP] Add XLNet support for Reader cdqa-suite/cdQA#205

Open

10 tasks

stale bot added the wontfix label Oct 20, 2019

stale bot closed this as completed Oct 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLNet-large-cased on Squad 2.0: can't replicate results #822

XLNet-large-cased on Squad 2.0: can't replicate results #822

avisil commented Jul 18, 2019

avisil commented Jul 18, 2019

avisil commented Jul 22, 2019

Zhiyu-Chen commented Jul 23, 2019

thomwolf commented Jul 23, 2019 •

edited

Loading

ntubertchen commented Jul 29, 2019

thomwolf commented Aug 5, 2019

ghost commented Aug 21, 2019

stale bot commented Oct 20, 2019

panl2015 commented Jan 20, 2020

LysandreJik commented Jan 21, 2020

panl2015 commented Jan 21, 2020 •

edited

Loading

XLNet-large-cased on Squad 2.0: can't replicate results #822

XLNet-large-cased on Squad 2.0: can't replicate results #822

Comments

avisil commented Jul 18, 2019

avisil commented Jul 18, 2019

avisil commented Jul 22, 2019

Zhiyu-Chen commented Jul 23, 2019

thomwolf commented Jul 23, 2019 • edited Loading

ntubertchen commented Jul 29, 2019

thomwolf commented Aug 5, 2019

ghost commented Aug 21, 2019

stale bot commented Oct 20, 2019

panl2015 commented Jan 20, 2020

LysandreJik commented Jan 21, 2020

panl2015 commented Jan 21, 2020 • edited Loading

thomwolf commented Jul 23, 2019 •

edited

Loading

panl2015 commented Jan 21, 2020 •

edited

Loading