-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLNet-large-cased: hyper-parameters for fine-tuning on SST-2 #795
Comments
I also tried to finetune xlnet base on squad 2.0 but the numbers on dev are pretty bad |
I suspect something is wrong with the evaluation code. Looking into it now. |
@tbright17 Nothing wrong with evaluation. Accuracy and evaluation loss aren't changed during training. I used my own evaluation script, I used old BertAdam or OpenAIAdam optimizers without success. |
I'll give a look, I've only tested XLNet on STS-B for the moment. You should check the hyper-parameters as well, they probably won't be the same as the ones of STS-B (some are mentioned in the XLNet paper). |
First thing that comes to mind is that SST-2 is ~10 times bigger than STS-B (see the GLUE paper) so you need to increase the number of training step a lot if you want to do at least one full epoch on SST-2 training dataset (here you use the value for STS-B). And you should probably do several epochs, e.g. we do 6-7 epochs on STS-B). Check some examples of recommended hyper-parameters table 8 of the xlnet paper. You can also directly specify the number of epochs instead of the maximum number of steps in the script. You can see all the hyper-parameters of the script with |
I trained STS-B task with the same problem. You can see the following output with evaluation of every 100 steps (I added train and evaluation loss in output):
How you can see training loss is increasing, eval loss is almost the same, other metrics fluctuate around 0. |
@thomwolf So, it looks like training is happening but in opposite direction for some reason |
Maybe you haven't fully read the explanation accompanying the STS-B example in the readme? It says "On this machine we thus have a batch size of 32, please increase |
@avostryakov Did you try to reduce the learning rate? I had a similar issue training with the TensorFlow version XLNet on only one GPU. I tried reducing the learning rate from 5e-5 to 1e-5, and it worked. Wish this can help you. |
@thomwolf @tbright17 I got similar numbers like you Squad 2.0. Seems that the model probably isn't learning much. I'll print out the losses to explore. Also should we change the LR as well? |
May also be a problem of batch size, the authors use a batch size between 32 and 128 in the paper. What effective batch size do you have (printed during training)? While we reproduce the official XLNet number on STS-B, I still have to work a bit on the SQuAD example for XLNet, the XLNet authors used a complex pre- and post-processing of the data (smarter than Bert's) that I haven't fully integrated into our |
@thomwolf You are right, STS-B started to train with batch size 32 and gradient_accumulation_steps = 2. Now I'm wondering why it so heavily depends on batch size. But it doesn't help for STS-2, I set max_steps=5000 (it's 5 epochs) and training and evaluation loss didn't change at all during training. I'm trying to train with learning rate 1e-5 how it was recommended by @alexpython1988 |
@thomwolf maybe. Also my sequence length is
I saw in the renatoviolin's repo that they have the following which gives them Also, lr is different than ours ( |
Learning rate = 1e-5 helps to train STS-2 together with batch size 32 and accumulation steps = 2. I need more experiments but it works. Thanks, @thomwolf, and @alexpython1988! |
Great to hear, good job and good luck @avostryakov! Feel free to share good hyper-parameters if you find a nice set and I can add them to the documentation (with credits). |
I was using per_gpu_train_batch 8 for squad 2.0. Powerful model is hard to tune maybe |
@thomwolf My the best result for SST-2 so far is 94.15 of accuracy (in xlnet's article 95.6). It's better than BERT-large. I trained with the following parameters:
|
@thomwolf Ok, the last result for SST-2 almost matched with XLNet article: Accuracy 95.4:
Thank you for your work! |
This is great @avostryakov! Thanks for sharing the results! |
Hi, how could I finetune the model for text generation? Is it possible just having raw text for the finetuning? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
* Add support for decision transformer (Closes huggingface#794) * Comment out supported decision transformer models Models are in the `onnx-community` org on HF
I tried to finetune XLNet on one of the classification tasks from GLUE (Ubuntu, GPU Titan RTX, CUDA 10.0, pytorch 1.1):
export GLUE_DIR=/path/to/glue
python ./examples/run_glue.py
--model_type xlnet
--model_name_or_path xlnet-large-cased
--do_train
--do_eval
--task_name=sst-2
--data_dir=${GLUE_DIR}/SST-2
--output_dir=./proc_data/sst-2
--max_seq_length=128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--gradient_accumulation_steps=1
--max_steps=1200
--model_name=xlnet-large-cased
--overwrite_output_dir
--overwrite_cache
--warmup_steps=120
Training and evaluation work without errors but it looks like accuracy doesn't increase during training, I evaluated every 500 steps:
07/16/2019 22:29:30 - INFO - main - ***** Eval results *****
07/16/2019 22:29:30 - INFO - main - acc = 0.5091743119266054
07/16/2019 22:32:16 - INFO - main - Loading features from cached file glue_data/SST-2/cached_dev_xlnet-large-cased_128_sst-2 | 999/8419 [05:37<41:47, 2.96it/s]
07/16/2019 22:32:17 - INFO - main - ***** Running evaluation *****
07/16/2019 22:32:17 - INFO - main - Num examples = 872
07/16/2019 22:32:17 - INFO - main - Batch size = 8
07/16/2019 22:32:25 - INFO - main - ***** Eval results *****
07/16/2019 22:32:25 - INFO - main - acc = 0.5091743119266054
Finally the same acc:
07/16/2019 22:33:59 - INFO - main - ***** Eval results *****
07/16/2019 22:33:59 - INFO - main - acc = 0.5091743119266054
The same situation is with my own classification dataset. Accuracy wasn't changed during training. Something is wrong with finetuning of XLNet
The text was updated successfully, but these errors were encountered: