Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new extract answer feature #148

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/aviary/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
EvalAnswerMode,
encode_image_to_base64,
eval_answer,
extract_answer_llm,
is_coroutine_callable,
partial_format,
)
Expand Down Expand Up @@ -81,7 +82,7 @@
"argref_by_name",
"encode_image_to_base64",
"eval_answer",
"eval_answer",
"extract_answer_llm",
"fenv",
"is_coroutine_callable",
"join",
Expand Down
53 changes: 53 additions & 0 deletions src/aviary/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,21 @@
"temperature": 0,
}

LLM_EXTRACT_CONFIG = {
"prompt": (
"You are evaluating answers for a test which has fixed options. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some statement that focuses the LLM on the message history?

Otherwise, in paper-qa, we witnessed the LLM using its innate knowledge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no message history here (?) Not sure what you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this function is responsible for both (1) extracting a letter that (2) ensuring it matches a multiple choice option.

What we saw in paper-qa was in the case of an empty string answer, the LLM would pull on its innate knowledge and could select the correct multiple choice option.

So for this:

Can you add some statement that focuses the LLM on the message history?

I guess what I should of said was can you add a statement that focuses the LLM on just the proposed: str, and tries to avoid pulling on any innate knowledge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - I was smart and didn't put the question into these, so there's no way it could get confused and try to answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we simultaneously posted

There is no question. So I don't see how it would be possible for it to attempt to answer. I don't know what else I could write to make it more clear in the prompt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see, very clever! I follow and you're right

Do you mind documenting that rationale somewhere in the code? Maybe a docstring in extract_answer_llm

"Repeat back which option the proposed answer matches. "
"GIVE ONLY THE VERBATIM TEXT OF A FIXED OPTION. "
"If the proposed answer is empty, invalid, or ambiguous, "
"return an empty string."
Comment on lines +30 to +31
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upstream some of the tests from https://github.com/Future-House/paper-qa/blob/v5.8.0/tests/test_litqa.py#L117 to here? I think we should also have "multiple options are matched" mentioned somewhere

I would be nice if paper-qa can just import this function and use it, instead of having its own evaluation LLM prompt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - will do

"\n\nOptions:\n{options}"
"\n\nProposed answer: {proposed_answer}"
),
"model": "gpt-4o-mini",
"temperature": 0,
}


LLM_SCORE_EVAL_CONFIG = LLM_EVAL_CONFIG | {
"prompt": (
"Here is a question, the correct answer to the question, and a rubric for"
Expand Down Expand Up @@ -88,6 +103,44 @@ def is_coroutine_callable(obj) -> bool:
return False


async def extract_answer_llm(
proposed: str,
options: list[str],
) -> str | None:
"""Extract the answer from a proposed answer and a list of options."""
if not proposed:
return None
for option in options:
if proposed.strip().casefold() == option.casefold().strip():
return option

try:
from litellm import acompletion
except ImportError as e:
raise ImportError(
"eval_answer requires the 'llm' extra for 'litellm'. Please:"
" `pip install aviary[llm]`."
) from e
Comment on lines +117 to +123
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at how ToolSelector.__init__ takes acompletion: "Callable[..., Awaitable[ModelResponse]] | None" = None, and apply it as an input arg here?

What this does is let people use local models.

Also, feel free to YAGNI on this one

config = LLM_EXTRACT_CONFIG
prompt = cast(str, config["prompt"]).format(
options="\n".join(options),
proposed_answer=proposed,
)

response = await acompletion(
model=config["model"],
temperature=config["temperature"],
messages=[{"content": prompt, "role": "user"}],
)

extracted = response.choices[0].message.content.strip()
for option in options:
if extracted.casefold() == option.casefold().strip():
return option

return None


async def eval_answer(
proposed: str,
correct: str,
Expand Down
109 changes: 109 additions & 0 deletions tests/cassettes/test_extract_answer_llm[complex].yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
interactions:
- request:
body:
'{"messages": [{"content": "You are evaluating answers for a test which
has fixed options. Repeat back which option the proposed answer matches. GIVE
ONLY THE VERBATIM TEXT OF A FIXED OPTION. If the proposed answer is empty, invalid,
or ambiguous, return an empty string.\n\nOptions:\nEconomic factors\nSocial
unrest\nPolitical corruption\n\nProposed answer: Based on the context given,
Serif et al. (2026) claim that the overwhelming cause of regime collapse arises
from economic factors. Yet, most other scholars (Gerald and Robinson for example)
believe the collapse was due to social unrest because of the prolonged epidemic
of 2025. I tend to agree with the majority - although I can see both sides.
Thus my response is that the social unrest was the significant factor in the
collapse of the regime.", "role": "user"}], "model": "gpt-4o-mini", "temperature":
0}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate
connection:
- keep-alive
content-length:
- "866"
content-type:
- application/json
host:
- api.openai.com
user-agent:
- AsyncOpenAI/Python 1.57.2
x-stainless-arch:
- arm64
x-stainless-async:
- async:asyncio
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.57.2
x-stainless-raw-response:
- "true"
x-stainless-retry-count:
- "1"
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.12.4
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAAwAAAP//jJJBTwIxEIXv+yuanlkDLLDITT140OiBxJgYs+m2w1Lpdpp2ViGE/266
IAsREy89zDfv9c2024QxrhWfMS6XgmTtTHqjHp59/YQyyLn6vJ28Klo/ipfNF6zu7nkvKrD8AEk/
qiuJtTNAGu0eSw+CILoO8iy7zibjPGtBjQpMlFWO0hGmtbY6HfaHo7Sfp4PpQb1ELSHwGXtLGGNs
254xp1Ww5jPW7/1UaghBVMBnxybGuEcTK1yEoAMJS7zXQYmWwLbR5yi1MKyxHsJZj4dFE0TMaRtj
DvXd8VKDlfNYhgM/1hfa6rAsPIiANl4QCB1v6S5h7L0drjnLy53H2lFBuAIbDQeT8d6Pdzvt6PDA
CEmYU1Heu2BXKCChTTjZDpdCLkF10m6VolEaT0ByMvTvMJe894NrW/3HvgNSgiNQhfOgtDwfuGvz
EH/cX23HJbeBedgEgrpYaFuBd17v33vhirIUmZxC3i95sku+AQAA//8DAAcy6K79AgAA
headers:
CF-Cache-Status:
- DYNAMIC
CF-RAY:
- 8f070b7e9be306ad-SJC
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Wed, 11 Dec 2024 17:02:53 GMT
Server:
- cloudflare
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
openai-organization:
- future-house-xr4tdh
openai-processing-ms:
- "244"
openai-version:
- "2020-10-01"
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-ratelimit-limit-requests:
- "30000"
x-ratelimit-limit-tokens:
- "150000000"
x-ratelimit-remaining-requests:
- "29999"
x-ratelimit-remaining-tokens:
- "149999790"
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_e8f3bb69f3add846e2a40af6c0982db6
status:
code: 200
message: OK
version: 1
102 changes: 102 additions & 0 deletions tests/cassettes/test_extract_answer_llm[not exact].yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
interactions:
- request:
body:
'{"messages": [{"content": "You are evaluating answers for a test which
has fixed options. Repeat back which option the proposed answer matches. GIVE
ONLY THE VERBATIM TEXT OF A FIXED OPTION. If the proposed answer is empty, invalid,
or ambiguous, return an empty string.\n\nOptions:\nA\nB\nC\n\nProposed answer:
A or B", "role": "user"}], "model": "gpt-4o-mini", "temperature": 0}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate
connection:
- keep-alive
content-length:
- "380"
content-type:
- application/json
host:
- api.openai.com
user-agent:
- AsyncOpenAI/Python 1.57.2
x-stainless-arch:
- arm64
x-stainless-async:
- async:asyncio
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.57.2
x-stainless-raw-response:
- "true"
x-stainless-retry-count:
- "1"
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.12.4
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAA4xSy07DMBC85yusPTcobVpCe0PigARSOSEkhCLH3iYGxzb2FlGq/jty+krVInHx
YWZnPLP2OmEMlIQZA9FwEq3T6a18mH/OH5ufVRVoNZk+PRfmXovs5e4rz2EQFbZ6R0F71ZWwrdNI
ypotLTxywug6LPJ8ml9PilFHtFaijrLaUTq2aauMSkfZaJxmRTq82akbqwQGmLHXhDHG1t0ZcxqJ
3zBj2WCPtBgCrxFmhyHGwFsdEeAhqEDcEAyOpLCG0HTR+7DHxTLwGM0std7hm8M92tbO2yrs+AO+
UEaFpvTIgzXRM5B10LGbhLG3rs/yJCI4b1tHJdkPNNGwyLd2cNzikdxVBbLE9QXNiVkpkbjSobcO
EFw0KM8MGQO+lMr2iKRX+TzLJe9tbWXq/9gfCSHQEcrSeZRKXOzbmccv9tfYYcVdYAirQNiWC2Vq
9M6r7QMvXFlVPBc3WGQVJJvkFwAA//8DABYKnlruAgAA
headers:
CF-Cache-Status:
- DYNAMIC
CF-RAY:
- 8f070b799eb5679d-SJC
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Wed, 11 Dec 2024 17:02:52 GMT
Server:
- cloudflare
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
openai-organization:
- future-house-xr4tdh
openai-processing-ms:
- "193"
openai-version:
- "2020-10-01"
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-ratelimit-limit-requests:
- "30000"
x-ratelimit-limit-tokens:
- "150000000"
x-ratelimit-remaining-requests:
- "29999"
x-ratelimit-remaining-tokens:
- "149999912"
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_24d3312a6ad717a657fe3e693bd24613
status:
code: 200
message: OK
version: 1
Loading
Loading