Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLNet-large-cased: hyper-parameters for fine-tuning on SST-2 #795

Closed
avostryakov opened this issue Jul 16, 2019 · 21 comments
Closed

XLNet-large-cased: hyper-parameters for fine-tuning on SST-2 #795

avostryakov opened this issue Jul 16, 2019 · 21 comments
Labels

Comments

@avostryakov
Copy link

I tried to finetune XLNet on one of the classification tasks from GLUE (Ubuntu, GPU Titan RTX, CUDA 10.0, pytorch 1.1):

export GLUE_DIR=/path/to/glue

python ./examples/run_glue.py
--model_type xlnet
--model_name_or_path xlnet-large-cased
--do_train
--do_eval
--task_name=sst-2
--data_dir=${GLUE_DIR}/SST-2
--output_dir=./proc_data/sst-2
--max_seq_length=128
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--gradient_accumulation_steps=1
--max_steps=1200
--model_name=xlnet-large-cased
--overwrite_output_dir
--overwrite_cache
--warmup_steps=120

Training and evaluation work without errors but it looks like accuracy doesn't increase during training, I evaluated every 500 steps:

07/16/2019 22:29:30 - INFO - main - ***** Eval results *****
07/16/2019 22:29:30 - INFO - main - acc = 0.5091743119266054

07/16/2019 22:32:16 - INFO - main - Loading features from cached file glue_data/SST-2/cached_dev_xlnet-large-cased_128_sst-2 | 999/8419 [05:37<41:47, 2.96it/s]
07/16/2019 22:32:17 - INFO - main - ***** Running evaluation *****
07/16/2019 22:32:17 - INFO - main - Num examples = 872
07/16/2019 22:32:17 - INFO - main - Batch size = 8

07/16/2019 22:32:25 - INFO - main - ***** Eval results *****
07/16/2019 22:32:25 - INFO - main - acc = 0.5091743119266054

Finally the same acc:

07/16/2019 22:33:59 - INFO - main - ***** Eval results *****
07/16/2019 22:33:59 - INFO - main - acc = 0.5091743119266054

The same situation is with my own classification dataset. Accuracy wasn't changed during training. Something is wrong with finetuning of XLNet

@tbright17
Copy link

I also tried to finetune xlnet base on squad 2.0 but the numbers on dev are pretty bad
Results: {'exact': 3.0405120862461046, 'f1': 6.947601433150003, 'total': 11873, 'HasAns_exact': 6.056005398110662, 'HasAns_f1': 13.881388632893048, 'HasAns_total': 5928, 'NoAns_exact': 0.0336417157275021, 'NoAns_f1': 0.0336417157275021, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

@tbright17
Copy link

I suspect something is wrong with the evaluation code. Looking into it now.

@avostryakov
Copy link
Author

@tbright17 Nothing wrong with evaluation. Accuracy and evaluation loss aren't changed during training. I used my own evaluation script, I used old BertAdam or OpenAIAdam optimizers without success.
@thomwolf Can you help?

@thomwolf
Copy link
Member

I'll give a look, I've only tested XLNet on STS-B for the moment. You should check the hyper-parameters as well, they probably won't be the same as the ones of STS-B (some are mentioned in the XLNet paper).

@thomwolf
Copy link
Member

First thing that comes to mind is that SST-2 is ~10 times bigger than STS-B (see the GLUE paper) so you need to increase the number of training step a lot if you want to do at least one full epoch on SST-2 training dataset (here you use the value for STS-B). And you should probably do several epochs, e.g. we do 6-7 epochs on STS-B). Check some examples of recommended hyper-parameters table 8 of the xlnet paper.

You can also directly specify the number of epochs instead of the maximum number of steps in the script. You can see all the hyper-parameters of the script with python ./run_glue.py --help.

@avostryakov
Copy link
Author

First thing that comes to mind is that SST-2 is ~10 times bigger than STS-B (see the GLUE paper) so you need to increase the number of training step a lot if you want to do at least one full epoch on SST-2 training dataset (here you use the value for STS-B). And you should probably do several epochs, e.g. we do 6-7 epochs on STS-B). Check some examples of recommended hyper-parameters table 8 of the xlnet paper.

You can also directly specify the number of epochs instead of the maximum number of steps in the script. You can see all the hyper-parameters of the script with python ./run_glue.py --help.

I trained STS-B task with the same problem. You can see the following output with evaluation of every 100 steps (I added train and evaluation loss in output):

07/17/2019 13:09:55 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:09:55 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:09:55 - INFO - __main__ -     Batch size = 8
07/17/2019 13:10:09 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:10:09 - INFO - __main__ -     corr = -0.05367882385720809
07/17/2019 13:10:09 - INFO - __main__ -     eval_loss = 2.8412214481133096##################################################################################################################| 188/188 [00:14<00:00, 13.41it/s]
07/17/2019 13:10:09 - INFO - __main__ -     pearson = -0.041275192
07/17/2019 13:10:09 - INFO - __main__ -     spearmanr = -0.06608245566229025
07/17/2019 13:10:09 - INFO - __main__ -   Training loss: 307.258519500494
                                                                                                                                                                                                                              07/17/2019 13:10:41 - INFO - __main__ -   Loading features from cached file ...glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 199/719 [01:18<03:25,  2.53it/s]
07/17/2019 13:10:41 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:10:41 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:10:41 - INFO - __main__ -     Batch size = 8
07/17/2019 13:10:56 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:10:56 - INFO - __main__ -     corr = 0.13943037650184956
07/17/2019 13:10:56 - INFO - __main__ -     eval_loss = 2.3762524007482733##################################################################################################################| 188/188 [00:14<00:00, 13.29it/s]
07/17/2019 13:10:56 - INFO - __main__ -     pearson = 0.13502572
07/17/2019 13:10:56 - INFO - __main__ -     spearmanr = 0.1438350282350605
07/17/2019 13:10:56 - INFO - __main__ -   Training loss: 533.9101385176182
                                                                                                                                                                                                                              07/17/2019 13:11:28 - INFO - __main__ -   Loading features from cached file .../glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 299/719 [02:05<02:56,  2.39it/s]
07/17/2019 13:11:28 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:11:28 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:11:28 - INFO - __main__ -     Batch size = 8
07/17/2019 13:11:42 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:11:42 - INFO - __main__ -     corr = -0.0830871973267994
07/17/2019 13:11:42 - INFO - __main__ -     eval_loss = 2.5565993221516305##################################################################################################################| 188/188 [00:14<00:00, 13.20it/s]
07/17/2019 13:11:42 - INFO - __main__ -     pearson = -0.08915693
07/17/2019 13:11:42 - INFO - __main__ -     spearmanr = -0.077017461524765
07/17/2019 13:11:42 - INFO - __main__ -   Training loss: 761.6802722513676
                                                                                                                                                                                                                              07/17/2019 13:12:15 - INFO - __main__ -   Loading features from cached file .../glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 399/719 [02:52<02:18,  2.32it/s]
07/17/2019 13:12:15 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:12:15 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:12:15 - INFO - __main__ -     Batch size = 8
07/17/2019 13:12:29 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:12:29 - INFO - __main__ -     corr = -0.08715267932681456
07/17/2019 13:12:29 - INFO - __main__ -     eval_loss = 2.398741365113157###################################################################################################################| 188/188 [00:14<00:00, 13.12it/s]
07/17/2019 13:12:29 - INFO - __main__ -     pearson = -0.08428703
07/17/2019 13:12:29 - INFO - __main__ -     spearmanr = -0.09001832616862088
07/17/2019 13:12:29 - INFO - __main__ -   Training loss: 974.8287971913815

How you can see training loss is increasing, eval loss is almost the same, other metrics fluctuate around 0.

@avostryakov
Copy link
Author

avostryakov commented Jul 17, 2019

@thomwolf So, it looks like training is happening but in opposite direction for some reason

@thomwolf
Copy link
Member

thomwolf commented Jul 17, 2019

Maybe you haven't fully read the explanation accompanying the STS-B example in the readme?

It says "On this machine we thus have a batch size of 32, please increase gradient_accumulation_steps to reach the same batch size if you have a smaller machine."

@bugface
Copy link
Contributor

bugface commented Jul 17, 2019

@avostryakov Did you try to reduce the learning rate? I had a similar issue training with the TensorFlow version XLNet on only one GPU. I tried reducing the learning rate from 5e-5 to 1e-5, and it worked. Wish this can help you.

@avisil
Copy link

avisil commented Jul 17, 2019

@thomwolf @tbright17 I got similar numbers like you Squad 2.0. Seems that the model probably isn't learning much. I'll print out the losses to explore. Also should we change the LR as well?
: the best I got with fine-tuning on Squad 2.0 with a train_batch_size=8 and gas=1 all others are default on a single v100 gpu was:
07/16/2019 16:21:43 - INFO - __main__ - Results: {'exact': 26.438136949380947, 'f1': 28.470459931964722, 'total': 11873, 'HasAns_exact': 0.08434547908232119, 'HasAns_f1': 4.154819630940996, 'HasAns_total': 5928, 'NoAns_exact': 52.716568544995795, 'NoAns_f1': 52.716568544995795, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

@thomwolf
Copy link
Member

thomwolf commented Jul 17, 2019

May also be a problem of batch size, the authors use a batch size between 32 and 128 in the paper.

What effective batch size do you have (printed during training)?

While we reproduce the official XLNet number on STS-B, I still have to work a bit on the SQuAD example for XLNet, the XLNet authors used a complex pre- and post-processing of the data (smarter than Bert's) that I haven't fully integrated into our run_squad example yet.

@avostryakov
Copy link
Author

Maybe you haven't fully read the explanation accompanying the STS-B example in the readme?

It says "On this machine we thus have a batch size of 32, please increase gradient_accumulation_steps to reach the same batch size if you have a smaller machine."

@thomwolf You are right, STS-B started to train with batch size 32 and gradient_accumulation_steps = 2. Now I'm wondering why it so heavily depends on batch size. But it doesn't help for STS-2, I set max_steps=5000 (it's 5 epochs) and training and evaluation loss didn't change at all during training. I'm trying to train with learning rate 1e-5 how it was recommended by @alexpython1988

@avisil
Copy link

avisil commented Jul 17, 2019

@thomwolf maybe. Also my sequence length is 384: the authors did mention they prolly did 512. Here's my batch size related printout: I think the number of examples seem a lil low. No? I think Squad has about 150K examples (ha and na questions) and with the doc_stride I think it should be more than 150k examples (I think).

07/15/2019 13:23:32 - INFO - __main__ - ***** Running training *****
07/15/2019 13:23:32 - INFO - __main__ - Num examples = 133947
07/15/2019 13:23:32 - INFO - __main__ - Num Epochs = 3
07/15/2019 13:23:32 - INFO - __main__ - Instantaneous batch size per GPU = 4
07/15/2019 13:23:32 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
07/15/2019 13:23:32 - INFO - __main__ - Gradient Accumulation steps = 1
07/15/2019 13:23:32 - INFO - __main__ - Total optimization steps = 100461

I saw in the renatoviolin's repo that they have the following which gives them 86F1 on a RTX2080:
flags.DEFINE_integer("max_seq_length", default=512, help="Max sequence length") flags.DEFINE_integer("max_query_length", default=64, help="Max query length") flags.DEFINE_integer("doc_stride", default=128, help="Doc stride") flags.DEFINE_integer("max_answer_length", default=64, help="Max answer length")

Also, lr is different than ours (5e-5 in this repo):
flags.DEFINE_float("learning_rate", default=3e-5, help="initial learning rate")

@avostryakov
Copy link
Author

Learning rate = 1e-5 helps to train STS-2 together with batch size 32 and accumulation steps = 2. I need more experiments but it works. Thanks, @thomwolf, and @alexpython1988!

@thomwolf
Copy link
Member

Great to hear, good job and good luck @avostryakov! Feel free to share good hyper-parameters if you find a nice set and I can add them to the documentation (with credits).

@tbright17
Copy link

May also be a problem of batch size, the authors use a batch size between 32 and 128 in the paper.

What effective batch size do you have (printed during training)?

While we reproduce the official XLNet number on STS-B, I still have to work a bit on the SQuAD example for XLNet, the XLNet authors used a complex pre- and post-processing of the data (smarter than Bert's) that I haven't fully integrated into our run_squad example yet.

I was using per_gpu_train_batch 8 for squad 2.0. Powerful model is hard to tune maybe

@avostryakov
Copy link
Author

avostryakov commented Jul 17, 2019

Great to hear, good job and good luck @avostryakov! Feel free to share good hyper-parameters if you find a nice set and I can add them to the documentation (with credits).

@thomwolf My the best result for SST-2 so far is 94.15 of accuracy (in xlnet's article 95.6). It's better than BERT-large. I trained with the following parameters:

python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --evaluate_during_training \
    --do_eval   \
    --logging_steps 500 \
    --save_steps 3000 \
    --task_name=sst-2     \
    --data_dir=${GLUE_DIR}/SST-2  \
    --output_dir=./proc_data/sst-2   \
    --max_seq_length=128   \
    --learning_rate 1e-5 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=16000  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120 \
    --fp16

@avostryakov
Copy link
Author

avostryakov commented Jul 18, 2019

@thomwolf Ok, the last result for SST-2 almost matched with XLNet article: Accuracy 95.4:

python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --evaluate_during_training \
    --do_eval   \
    --logging_steps 400 \
    --save_steps 3000 \
    --task_name=sst-2     \
    --data_dir=${GLUE_DIR}/SST-2  \
    --output_dir=./proc_data/sst-2   \
    --max_seq_length=128   \
    --learning_rate 1e-5 \
    --per_gpu_eval_batch_size=16   \
    --per_gpu_train_batch_size=16   \
    --gradient_accumulation_steps=1 \
    --max_steps=8000  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120 \
    --fp16

Thank you for your work!

@thomwolf
Copy link
Member

This is great @avostryakov! Thanks for sharing the results!
I'm editing the issue title until I've time to add the hyperparameters to the doc.

@thomwolf thomwolf changed the title XLNet-large-cased is not finetuned! XLNet-large-cased: hyper-parameters for fine-tuning on SST-2 Jul 18, 2019
@sakalouski
Copy link

Hi, how could I finetune the model for text generation? Is it possible just having raw text for the finetuning?

@stale
Copy link

stale bot commented Sep 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 23, 2019
@stale stale bot closed this as completed Sep 30, 2019
cng420 pushed a commit to cng420/transformers that referenced this issue Nov 3, 2024
* Add support for decision transformer (Closes huggingface#794)

* Comment out supported decision transformer models

Models are in the `onnx-community` org on HF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants