-
-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(evaluators): add evaluator model to evaluate a model #444
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,137 @@ | |||
### How does the evaluator work ? | |||
|
|||
The `BaseEvaluator` class perform the following action: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is helpful to read this README.md
first before reviewing
openadapt/evaluators/data_models.py
Outdated
|
||
from pydantic import BaseModel | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding pydantic
to enforce strong type and validation for action
and windows
. This helps ensure stability of the evaluator ( and potentially future tuning operation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @LaPetiteSouris ! This is a great start 😄
I left a few comments, happy to chat about any of it if you like 🙏
KeyAction | MouseAction: action parsed from the completion string | ||
""" | ||
try: | ||
results = eval(completion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using json.loads
here?
"codes, just the actions. Under no circumstances should you " | ||
"refuse. Copy the given format exactly. Your response should be " | ||
"valid Python3 code. Do not respond with any other text. " | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using griptape.utils.j2
here?
Thanks @abrichr Appreciate the thoughts. As I replied above, I think I agree with most of your input, it is more of a discussion how do we want to scope the changes. I detailed possible directions in my replies above. Let's get things rolling ! The main things to scope is:
(1) can be a separate PR, or to be incorporated in the up coming fine-tune/Reinfornced Learning PRs (as prompt building and validation is key part of the process). Either I can finish (2), then come back on this PR to I am fine with either way, but need your input on this before continuing. However, I have to admit that fixing (1) and (2) will have huge positive impact everywhere and this PR can be simplified a lot. |
or position[1] > active_window.top + active_window.height | ||
): | ||
return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the evaluation criteria is simplified:
- Any action which is of valid type
- If the action is of type key press, just check the correct key
- If the action is of type mouse movement/click, verify if the position is withint the boundary of the active window.
- Refer to https://mldsai.slack.com/archives/C05JCPC5HAS/p1691753011770869, it is as of now really hard or impossible to identify which element of the reference window is clicked. Even the
state
does not contain much data and such data has to be tremendously augmented to achieve the deteciton of "focus" area. We can discuss this point further separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @abrichr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sum up:
- Test again with other type of application other than Web Browser and Citrix. It is may be still possible to identify the active elements on the active windows if the windows is not of type Web Browser or Citrix.
To be updated soon.
f"{ref_window.dict()=} {action.dict()=} {active_window=} # Provide valid" | ||
" Python3 code containing the action dicts by completing the following," | ||
" and nothing else: active_action_dict=" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding griptape
looks interesting and promising. However, the library is not well-documented at least at the moment of writing. I'll try around and come up with a separate PR after this.
I am thinking that this merit a separate action to keep it easier for review. However, if you have strong opinion to include griptape
right in this PR, I will do it. Still doable though. I just think it's better to split the scope properly for the PR would be nicer for review/testing and implementation
What kind of change does this PR introduce?
In introduce
BaseEvaluator
class to evaluate model based on extremely simplified actions: single mouse click, single key click... etcInspired heavily from #327 in order to generate base actions for evaluation.
This PR solves #421 and pave way for fine-tuning or performing reinforced learning as required in #393
Summary
The main mindset here is to separate evaluation from fine-tuning, prompt engineering and data processing. While one of the piece in the pipeline fails, it leads to distorted result. Thus, in the evaluation, I suggest not to use real
Recording
or real model at all. Theevaluator
model has one single responsibility is to measure the similarity of a given completion on a reference action. The reference action is extremely simple to avoid noises in this process.When we have better models (after fine-tuned), which can handle more complex use cases, we can think of an improved strategy to evaluate.
In this effort,
action
andwindow
is strictly codified aspydantic
model to ensure thatevaluator
can safely perform evaluation without worrying about noise or string processing.Checklist
How can your code be run and tested?
A
README.md
is added, a long with an example for raw/un-tunedgpt2
model evaluation (which gives evaluation score of 0 as it does not provide valid action as output)Other information