-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLNet-large-cased on Squad 2.0: can't replicate results #822
Comments
This is similar to what the authors ran in the paper (except I could fit only this on 3 v100 GPUs):
gives:
|
@thomwolf are you already working on this? I can work with you to try to solve it :) |
with the same question... Also got weird results on other QA datasets like BoolQ, MultiRC. |
@avisil not yet, I won't have time to work on this before ACL but you can start to have a look if you want. Such discrepancies pretty much always come, not from the model it-self but, from different settings for pre/post-processing the dataset or for the optimizer/optimization process. If you want to start giving it a look, the way I usually check exact reproducibility on downstream tasks like GLUE/SQuAD is to directly import the pytorch-transformer's model in the tensorflow code (that's the main reason the library is python 2 compatible), load the pytorch model with the initialized tf model and run the models side by side on the same inputs (on separate GPUs) to check-in details the inputs/outputs/hidden-states and so-on. It's better to do it on a GPU version of the TF code so you can setup the optimizer your-self. I think somebody did a GPU version of the official SQuAD example, but you can also take inspiration from the multi-GPU adaptation I did of the TensorFlow code for GLUE, which is here: https://github.com/thomwolf/xlnet/blob/master/run_classifier_gpu.py. In the case of SQuAD, I already know that they are a few differences which should be fixed:
|
I found similar problem on GLEU dataset. With the command: The final result of SST-2 is only 0.836, which is way lower than the current SoTA. Does anyone have a clue how to solve it? |
@ntubertchen good parameters for SST-2 are in the (adequately titled) issue #795 |
I encountered similar problem with bert-large models. No luck yet. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
looks like xlnet for squad 2.0 is broken:
gives:
I added
|
Thanks @LysandreJik ! I think that's how I fixed it locally to make it run but got the low result. Maybe I should try with your version to make sure I don't have other changes. |
I've been trying to replicate the numbers in the Squad 2.0 dev set (F1=86) with this script and the XLnet embeddings. So far the results are really off..{Opening a new issue as the previous one seems dedicated to SST-2}
python run_squad.py --do_lower_case --do_train --do_eval --train_file $SQUAD_DIR/train-v2.0.json --predict_file $SQUAD_DIR/dev-v2.0.json --output_dir $SQUAD_DIR/output --version_2_with_negative --model_name xlnet-large-cased --save_steps 5000 --num_train_epochs 3 --overwrite_output_dir --model_type xlnet --per_gpu_train_batch_size 4 --gradient_accumulation_steps 1 --learning_rate 3e-5
gives:
07/18/2019 08:43:36 - INFO - __main__ - Results: {'exact': 3.217383980459867, 'f1': 7.001376535240158, 'total': 11873, 'HasAns_exact': 6.359649122807017, 'HasAns_f1': 13.938485762973412, 'HasAns_total': 5928, 'NoAns_exact': 0.08410428931875526, 'NoAns_f1': 0.08410428931875526, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}
The text was updated successfully, but these errors were encountered: