-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new extract answer feature #148
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,21 @@ | |
"temperature": 0, | ||
} | ||
|
||
LLM_EXTRACT_CONFIG = { | ||
"prompt": ( | ||
"You are evaluating answers for a test which has fixed options. " | ||
"Repeat back which option the proposed answer matches. " | ||
"GIVE ONLY THE VERBATIM TEXT OF A FIXED OPTION. " | ||
"If the proposed answer is empty, invalid, or ambiguous, " | ||
"return an empty string." | ||
Comment on lines
+30
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you upstream some of the tests from https://github.com/Future-House/paper-qa/blob/v5.8.0/tests/test_litqa.py#L117 to here? I think we should also have "multiple options are matched" mentioned somewhere I would be nice if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup - will do |
||
"\n\nOptions:\n{options}" | ||
"\n\nProposed answer: {proposed_answer}" | ||
), | ||
"model": "gpt-4o-mini", | ||
"temperature": 0, | ||
} | ||
|
||
|
||
LLM_SCORE_EVAL_CONFIG = LLM_EVAL_CONFIG | { | ||
"prompt": ( | ||
"Here is a question, the correct answer to the question, and a rubric for" | ||
|
@@ -88,6 +103,44 @@ def is_coroutine_callable(obj) -> bool: | |
return False | ||
|
||
|
||
async def extract_answer_llm( | ||
proposed: str, | ||
options: list[str], | ||
) -> str | None: | ||
"""Extract the answer from a proposed answer and a list of options.""" | ||
if not proposed: | ||
return None | ||
for option in options: | ||
if proposed.strip().casefold() == option.casefold().strip(): | ||
return option | ||
|
||
try: | ||
from litellm import acompletion | ||
except ImportError as e: | ||
raise ImportError( | ||
"eval_answer requires the 'llm' extra for 'litellm'. Please:" | ||
" `pip install aviary[llm]`." | ||
) from e | ||
Comment on lines
+117
to
+123
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you look at how What this does is let people use local models. Also, feel free to YAGNI on this one |
||
config = LLM_EXTRACT_CONFIG | ||
prompt = cast(str, config["prompt"]).format( | ||
options="\n".join(options), | ||
proposed_answer=proposed, | ||
) | ||
|
||
response = await acompletion( | ||
model=config["model"], | ||
temperature=config["temperature"], | ||
messages=[{"content": prompt, "role": "user"}], | ||
) | ||
|
||
extracted = response.choices[0].message.content.strip() | ||
for option in options: | ||
if extracted.casefold() == option.casefold().strip(): | ||
return option | ||
|
||
return None | ||
|
||
|
||
async def eval_answer( | ||
proposed: str, | ||
correct: str, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
interactions: | ||
- request: | ||
body: | ||
'{"messages": [{"content": "You are evaluating answers for a test which | ||
has fixed options. Repeat back which option the proposed answer matches. GIVE | ||
ONLY THE VERBATIM TEXT OF A FIXED OPTION. If the proposed answer is empty, invalid, | ||
or ambiguous, return an empty string.\n\nOptions:\nEconomic factors\nSocial | ||
unrest\nPolitical corruption\n\nProposed answer: Based on the context given, | ||
Serif et al. (2026) claim that the overwhelming cause of regime collapse arises | ||
from economic factors. Yet, most other scholars (Gerald and Robinson for example) | ||
believe the collapse was due to social unrest because of the prolonged epidemic | ||
of 2025. I tend to agree with the majority - although I can see both sides. | ||
Thus my response is that the social unrest was the significant factor in the | ||
collapse of the regime.", "role": "user"}], "model": "gpt-4o-mini", "temperature": | ||
0}' | ||
headers: | ||
accept: | ||
- application/json | ||
accept-encoding: | ||
- gzip, deflate | ||
connection: | ||
- keep-alive | ||
content-length: | ||
- "866" | ||
content-type: | ||
- application/json | ||
host: | ||
- api.openai.com | ||
user-agent: | ||
- AsyncOpenAI/Python 1.57.2 | ||
x-stainless-arch: | ||
- arm64 | ||
x-stainless-async: | ||
- async:asyncio | ||
x-stainless-lang: | ||
- python | ||
x-stainless-os: | ||
- MacOS | ||
x-stainless-package-version: | ||
- 1.57.2 | ||
x-stainless-raw-response: | ||
- "true" | ||
x-stainless-retry-count: | ||
- "1" | ||
x-stainless-runtime: | ||
- CPython | ||
x-stainless-runtime-version: | ||
- 3.12.4 | ||
method: POST | ||
uri: https://api.openai.com/v1/chat/completions | ||
response: | ||
body: | ||
string: !!binary | | ||
H4sIAAAAAAAAAwAAAP//jJJBTwIxEIXv+yuanlkDLLDITT140OiBxJgYs+m2w1Lpdpp2ViGE/266 | ||
IAsREy89zDfv9c2024QxrhWfMS6XgmTtTHqjHp59/YQyyLn6vJ28Klo/ipfNF6zu7nkvKrD8AEk/ | ||
qiuJtTNAGu0eSw+CILoO8iy7zibjPGtBjQpMlFWO0hGmtbY6HfaHo7Sfp4PpQb1ELSHwGXtLGGNs | ||
254xp1Ww5jPW7/1UaghBVMBnxybGuEcTK1yEoAMJS7zXQYmWwLbR5yi1MKyxHsJZj4dFE0TMaRtj | ||
DvXd8VKDlfNYhgM/1hfa6rAsPIiANl4QCB1v6S5h7L0drjnLy53H2lFBuAIbDQeT8d6Pdzvt6PDA | ||
CEmYU1Heu2BXKCChTTjZDpdCLkF10m6VolEaT0ByMvTvMJe894NrW/3HvgNSgiNQhfOgtDwfuGvz | ||
EH/cX23HJbeBedgEgrpYaFuBd17v33vhirIUmZxC3i95sku+AQAA//8DAAcy6K79AgAA | ||
headers: | ||
CF-Cache-Status: | ||
- DYNAMIC | ||
CF-RAY: | ||
- 8f070b7e9be306ad-SJC | ||
Connection: | ||
- keep-alive | ||
Content-Encoding: | ||
- gzip | ||
Content-Type: | ||
- application/json | ||
Date: | ||
- Wed, 11 Dec 2024 17:02:53 GMT | ||
Server: | ||
- cloudflare | ||
Transfer-Encoding: | ||
- chunked | ||
X-Content-Type-Options: | ||
- nosniff | ||
access-control-expose-headers: | ||
- X-Request-ID | ||
alt-svc: | ||
- h3=":443"; ma=86400 | ||
openai-organization: | ||
- future-house-xr4tdh | ||
openai-processing-ms: | ||
- "244" | ||
openai-version: | ||
- "2020-10-01" | ||
strict-transport-security: | ||
- max-age=31536000; includeSubDomains; preload | ||
x-ratelimit-limit-requests: | ||
- "30000" | ||
x-ratelimit-limit-tokens: | ||
- "150000000" | ||
x-ratelimit-remaining-requests: | ||
- "29999" | ||
x-ratelimit-remaining-tokens: | ||
- "149999790" | ||
x-ratelimit-reset-requests: | ||
- 2ms | ||
x-ratelimit-reset-tokens: | ||
- 0s | ||
x-request-id: | ||
- req_e8f3bb69f3add846e2a40af6c0982db6 | ||
status: | ||
code: 200 | ||
message: OK | ||
version: 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
interactions: | ||
- request: | ||
body: | ||
'{"messages": [{"content": "You are evaluating answers for a test which | ||
has fixed options. Repeat back which option the proposed answer matches. GIVE | ||
ONLY THE VERBATIM TEXT OF A FIXED OPTION. If the proposed answer is empty, invalid, | ||
or ambiguous, return an empty string.\n\nOptions:\nA\nB\nC\n\nProposed answer: | ||
A or B", "role": "user"}], "model": "gpt-4o-mini", "temperature": 0}' | ||
headers: | ||
accept: | ||
- application/json | ||
accept-encoding: | ||
- gzip, deflate | ||
connection: | ||
- keep-alive | ||
content-length: | ||
- "380" | ||
content-type: | ||
- application/json | ||
host: | ||
- api.openai.com | ||
user-agent: | ||
- AsyncOpenAI/Python 1.57.2 | ||
x-stainless-arch: | ||
- arm64 | ||
x-stainless-async: | ||
- async:asyncio | ||
x-stainless-lang: | ||
- python | ||
x-stainless-os: | ||
- MacOS | ||
x-stainless-package-version: | ||
- 1.57.2 | ||
x-stainless-raw-response: | ||
- "true" | ||
x-stainless-retry-count: | ||
- "1" | ||
x-stainless-runtime: | ||
- CPython | ||
x-stainless-runtime-version: | ||
- 3.12.4 | ||
method: POST | ||
uri: https://api.openai.com/v1/chat/completions | ||
response: | ||
body: | ||
string: !!binary | | ||
H4sIAAAAAAAAA4xSy07DMBC85yusPTcobVpCe0PigARSOSEkhCLH3iYGxzb2FlGq/jty+krVInHx | ||
YWZnPLP2OmEMlIQZA9FwEq3T6a18mH/OH5ufVRVoNZk+PRfmXovs5e4rz2EQFbZ6R0F71ZWwrdNI | ||
ypotLTxywug6LPJ8ml9PilFHtFaijrLaUTq2aauMSkfZaJxmRTq82akbqwQGmLHXhDHG1t0ZcxqJ | ||
3zBj2WCPtBgCrxFmhyHGwFsdEeAhqEDcEAyOpLCG0HTR+7DHxTLwGM0std7hm8M92tbO2yrs+AO+ | ||
UEaFpvTIgzXRM5B10LGbhLG3rs/yJCI4b1tHJdkPNNGwyLd2cNzikdxVBbLE9QXNiVkpkbjSobcO | ||
EFw0KM8MGQO+lMr2iKRX+TzLJe9tbWXq/9gfCSHQEcrSeZRKXOzbmccv9tfYYcVdYAirQNiWC2Vq | ||
9M6r7QMvXFlVPBc3WGQVJJvkFwAA//8DABYKnlruAgAA | ||
headers: | ||
CF-Cache-Status: | ||
- DYNAMIC | ||
CF-RAY: | ||
- 8f070b799eb5679d-SJC | ||
Connection: | ||
- keep-alive | ||
Content-Encoding: | ||
- gzip | ||
Content-Type: | ||
- application/json | ||
Date: | ||
- Wed, 11 Dec 2024 17:02:52 GMT | ||
Server: | ||
- cloudflare | ||
Transfer-Encoding: | ||
- chunked | ||
X-Content-Type-Options: | ||
- nosniff | ||
access-control-expose-headers: | ||
- X-Request-ID | ||
alt-svc: | ||
- h3=":443"; ma=86400 | ||
openai-organization: | ||
- future-house-xr4tdh | ||
openai-processing-ms: | ||
- "193" | ||
openai-version: | ||
- "2020-10-01" | ||
strict-transport-security: | ||
- max-age=31536000; includeSubDomains; preload | ||
x-ratelimit-limit-requests: | ||
- "30000" | ||
x-ratelimit-limit-tokens: | ||
- "150000000" | ||
x-ratelimit-remaining-requests: | ||
- "29999" | ||
x-ratelimit-remaining-tokens: | ||
- "149999912" | ||
x-ratelimit-reset-requests: | ||
- 2ms | ||
x-ratelimit-reset-tokens: | ||
- 0s | ||
x-request-id: | ||
- req_24d3312a6ad717a657fe3e693bd24613 | ||
status: | ||
code: 200 | ||
message: OK | ||
version: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some statement that focuses the LLM on the message history?
Otherwise, in
paper-qa
, we witnessed the LLM using its innate knowledgeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no message history here (?) Not sure what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this function is responsible for both (1) extracting a letter that (2) ensuring it matches a multiple choice option.
What we saw in
paper-qa
was in the case of an empty string answer, the LLM would pull on its innate knowledge and could select the correct multiple choice option.So for this:
I guess what I should of said was can you add a statement that focuses the LLM on just the
proposed: str
, and tries to avoid pulling on any innate knowledge?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - I was smart and didn't put the question into these, so there's no way it could get confused and try to answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we simultaneously posted
There is no question. So I don't see how it would be possible for it to attempt to answer. I don't know what else I could write to make it more clear in the prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I see, very clever! I follow and you're right
Do you mind documenting that rationale somewhere in the code? Maybe a docstring in
extract_answer_llm