Question about the GenerationConfig in commonsense_evaluate.py #6

StarLooo · 2024-10-10T12:01:37Z

Thanks for the fast open source.
I find that in the commonsense_evaluate.py Line 52~58, the value of the parameter do_sample of GenerationConfig has not been set, and the default value of do_sample is False. Then, with the do_sample=False and num_beams=4, the model will generate using beam-search decoding.
Besides, I also find that the Line 60~66 may not pass the value of related attention_mask which may cause warning in transformers library.
I don't know whether this behavior is intended, and the right way (hyper-param in generate) to reproduce the results in Table 4. of this paper.

The text was updated successfully, but these errors were encountered:

StarLooo · 2024-10-10T12:03:50Z

Thanks for the fast open source. I find that in the commonsense_evaluate.py Line 52~58, the value of the parameter do_sample of GenerationConfig has not been set, and the default value of do_sample is False. Then, with the do_sample=False and num_beams=4, the model will generate using beam-search decoding. Besides, I also find that the Line 60~66 may not pass the value of related attention_mask which may cause warning in transformers library. I don't know whether this behavior is intended, and the right way (hyper-param in generate) to reproduce the results in Table 4. of this paper.

By the way, when using do_sample=False, it conflicts with (and will override) other settings like temperature, top_k, and top_p.

wutaiqiang · 2024-10-10T12:28:04Z

Thanks for your kind reminder.

The hyperparameters are not intended but follow https://github.com/AGI-Edgerunners/LLM-Adapters

I understand that such a setting would affect the results but keep the same for all baselines for fair comparison.

StarLooo · 2024-10-11T02:11:58Z

Thanks for your kind reminder.

The hyperparameters are not intended but follow https://github.com/AGI-Edgerunners/LLM-Adapters

I understand that such a setting would affect the results but keep the same for all baselines for fair comparison.

Thanks for your reply.
However, I still have some questions:

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).
If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.
I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

wutaiqiang · 2024-10-11T04:21:58Z

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).

I have not tried this. I think there would be something deeper in the transformer's decoding process. BTW, we must set the batch size to 1 when decoding. (refer to huggingface/transformers#25921)

If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.

You can try that.

I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

Yes, the results are not stable. I guess the reasons are as 2-folds:
i) there are some random factors in the training process
ii) Evaluation method. I follow https://github.com/AGI-Edgerunners/LLM-Adapters to evaluate the answers. They first extract the answer and then compare the string. The way to extract the answer is to match the first string with the format Answer*. However, sometimes the model may repeat the options and then generate the answer. So the extracted answer is answer1 and this response is marked as false if the gt is another answer. This question remains but is hard to solve. LLM would not always output answers first. The most suitable way to compare is in the semantic level rather than the string level. That is why recent benchmarks employ GPT to score rather than compare generated strings.

StarLooo · 2024-10-12T00:13:14Z

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).

I have not tried this. I think there would be something deeper in the transformer's decoding process. BTW, we must set the batch size to 1 when decoding. (refer to huggingface/transformers#25921)

If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.

You can try that.

I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

Yes, the results are not stable. I guess the reasons are as 2-folds: i) there are some random factors in the training process ii) Evaluation method. I follow https://github.com/AGI-Edgerunners/LLM-Adapters to evaluate the answers. They first extract the answer and then compare the string. The way to extract the answer is to match the first string with the format Answer*. However, sometimes the model may repeat the options and then generate the answer. So the extracted answer is answer1 and this response is marked as false if the gt is another answer. This question remains but is hard to solve. LLM would not always output answers first. The most suitable way to compare is in the semantic level rather than the string level. That is why recent benchmarks employ GPT to score rather than compare generated strings.

Thanks, and I will try more experiments.

wutaiqiang · 2024-10-12T06:34:46Z

One related response:

AGI-Edgerunners/LLM-Adapters#64 (comment)

StarLooo closed this as completed Oct 11, 2024

wutaiqiang reopened this Oct 11, 2024

StarLooo closed this as completed Oct 12, 2024

wutaiqiang mentioned this issue Oct 18, 2024

Suggestion: Improve Answer Extraction Logic to better evaluate QA tasks. #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the GenerationConfig in commonsense_evaluate.py #6

Question about the GenerationConfig in commonsense_evaluate.py #6

StarLooo commented Oct 10, 2024

StarLooo commented Oct 10, 2024

wutaiqiang commented Oct 10, 2024

StarLooo commented Oct 11, 2024

wutaiqiang commented Oct 11, 2024

StarLooo commented Oct 12, 2024

wutaiqiang commented Oct 12, 2024

Question about the GenerationConfig in commonsense_evaluate.py #6

Question about the GenerationConfig in commonsense_evaluate.py #6

Comments

StarLooo commented Oct 10, 2024

StarLooo commented Oct 10, 2024

wutaiqiang commented Oct 10, 2024

StarLooo commented Oct 11, 2024

wutaiqiang commented Oct 11, 2024

StarLooo commented Oct 12, 2024

wutaiqiang commented Oct 12, 2024