You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe the logic for extract_answer needs some adjustments. Instances where multiple answers are present in the responses shouldn't be marked as correct. For example, I've observed cases where the model simply copies the options from the instruction as its response, like: answer1/answer2/answer3/answer4. This inflates the accuracy of the QA task beyond the actual performance.
Here's the revised extract_answer function:
def extract_answer(args, sentence: str) -> float:
dataset = args.dataset
sentence_ = sentence.strip()
if dataset == 'boolq':
pred_answers = re.findall(r'true|false', sentence_)
elif dataset == 'piqa':
pred_answers = re.findall(r'solution1|solution2', sentence_)
elif dataset in ['social_i_qa', 'ARC-Challenge', 'ARC-Easy', 'openbookqa']:
pred_answers = re.findall(r'answer1|answer2|answer3|answer4|answer5', sentence_)
elif dataset == 'hellaswag':
pred_answers = re.findall(r'ending1|ending2|ending3|ending4', sentence_)
elif dataset == 'winogrande':
pred_answers = re.findall(r'option1|option2', sentence_)
if not pred_answers:
return ""
unique_answers = set(pred_answers)
# if only one answer, then return it
if len(unique_answers) == 1:
return unique_answers.pop()
else:
return ""
This should help improve accuracy by ensuring only a single answer is considered correct.
The text was updated successfully, but these errors were encountered:
Hi,
I believe the logic for
extract_answer
needs some adjustments. Instances where multiple answers are present in the responses shouldn't be marked as correct. For example, I've observed cases where the model simply copies the options from the instruction as its response, like: answer1/answer2/answer3/answer4. This inflates the accuracy of the QA task beyond the actual performance.Here's the revised
extract_answer
function:This should help improve accuracy by ensuring only a single answer is considered correct.
The text was updated successfully, but these errors were encountered: