-
-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP feat(model_evaluation): Add script to evaluate models #420
Conversation
clean_up_tokenization_spaces=True, | ||
) | ||
active_action_dicts = get_action_dict_from_completion(completion) | ||
logger.debug(f"active_action_dicts=\n{pformat(active_action_dicts)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abrichr May be am I wrong here, but using local gpt-2
model from transformers
package, I do not get a lot of correct prediction. Must of the prediction given by gpt-2
model cannot be parsed to correct action_event
.
Is there any special tricks that I miss ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is truly the case, then either my understand of how to build the prompt
from Event is incorrect, or we need to re-consider the way a prompt is built.
May be related to #327 and #419
I was thinking about waiting until #419 solved. We can fine-tune models with like 50% less time commitment by then. Because it is hard to reasonably conclude that RF actually helps us without a baseline on what to measure and how to measure things. I.e: we would spend tons of time on fine-tuning the model, then there is no way to "prove" that the output model actually worth it or not.
Also, most of the code/tests used in #419 will be re-used here to measure the performance gains of the model after reinforced learning.
WDYT ? @abrichr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LaPetiteSouris thank you for the information regarding GPT-2! We have been testing with GPT3.5-turbo and GPT-4.
For enforcing output structure, we have looked at implementing guardrails and LMQL/Guidance, but so far without much luck. If you could assist in this as well that would be greatly appreciate 🙏
You are correct that we need to implement evaluation metrics. Related: #173 #414
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abrichr I'll take a look into LMQL Guidance/guardrails topic as soon as I wrap up my work on these model evaluations/tuning topic.
return distance | ||
|
||
|
||
def _calculate_similarity_per_action(ref_action, prediction_action): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple algorithm is to
- Verify if 2 actions are of the same type
- If they are both button pressing or releasing, the compare the key
- If they are mouse movement, then calculate the Euclidean between the 2 clicked point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point is the compare how similar the reference action to the predicted action.
from openadapt import crud, models, utils | ||
|
||
LOG_LEVEL = "INFO" | ||
MAX_SCREEN_SIZE = (1920, 1080) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about reading this from openadapt.models.recording
: https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L29 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will take that into account.
Close and superseded by #444 |
What kind of change does this PR introduce?
Solve #421
Summary
Build a small dataset from the last recording. The dataset contains the following attribute
reference_window_dict
,reference_action_dicts
,active_window_dict
(which is alsoreference_window_dict
for training datasetBuild a simple algorithm to score a prediction. The algorithm is as simple as following
If the reference action and the predicted action is of different type, return 0, as they are not similar
If the reference action and predicted action is of same type
press
orrelease
, compare the pressed/released key. If the same key is pressed or released, return score 1, else return score 0If the reference action and predicted action is of type mouse movement, calculate the Euclidean distance between the 2 points (reference and predicted). Normalize the distance between 0-1 based on the max size of the screen. Then invert the score. I.e: 2 identical points should have score of 0 and vice-versa.
Next Steps
Use Reinforced Learning to improve the model and re-run the above steps to compare if the score has been improved or not
Checklist
How can your code be run and tested?
Other information