Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP feat(model_evaluation): Add script to evaluate models #420

Closed
wants to merge 1 commit into from

Conversation

LaPetiteSouris
Copy link
Contributor

@LaPetiteSouris LaPetiteSouris commented Jul 23, 2023

What kind of change does this PR introduce?

Solve #421

Summary

  • Build a small dataset from the last recording. The dataset contains the following attribute
    reference_window_dict, reference_action_dicts, active_window_dict (which is also reference_window_dict for training dataset

  • Build a simple algorithm to score a prediction. The algorithm is as simple as following

  1. If the reference action and the predicted action is of different type, return 0, as they are not similar

  2. If the reference action and predicted action is of same type press or release, compare the pressed/released key. If the same key is pressed or released, return score 1, else return score 0

  3. If the reference action and predicted action is of type mouse movement, calculate the Euclidean distance between the 2 points (reference and predicted). Normalize the distance between 0-1 based on the max size of the screen. Then invert the score. I.e: 2 identical points should have score of 0 and vice-versa.

  • For each entry in the dataset, build a prompt and ask for prediction from GPT-2. Then calculate the score.

Next Steps

Use Reinforced Learning to improve the model and re-run the above steps to compare if the score has been improved or not

Checklist

  • My code follows the style guidelines of OpenAdapt
  • I have performed a self-review of my code
  • If applicable, I have added tests to prove my fix is functional/effective
  • I have linted my code locally prior to submission
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
  • New and existing unit tests pass locally with my changes

How can your code be run and tested?

python -m openadapt.models_tuning.fine_tune_models

Other information

clean_up_tokenization_spaces=True,
)
active_action_dicts = get_action_dict_from_completion(completion)
logger.debug(f"active_action_dicts=\n{pformat(active_action_dicts)}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abrichr May be am I wrong here, but using local gpt-2 model from transformers package, I do not get a lot of correct prediction. Must of the prediction given by gpt-2 model cannot be parsed to correct action_event.

Is there any special tricks that I miss ?

Copy link
Contributor Author

@LaPetiteSouris LaPetiteSouris Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is truly the case, then either my understand of how to build the prompt from Event is incorrect, or we need to re-consider the way a prompt is built.

May be related to #327 and #419

I was thinking about waiting until #419 solved. We can fine-tune models with like 50% less time commitment by then. Because it is hard to reasonably conclude that RF actually helps us without a baseline on what to measure and how to measure things. I.e: we would spend tons of time on fine-tuning the model, then there is no way to "prove" that the output model actually worth it or not.

Also, most of the code/tests used in #419 will be re-used here to measure the performance gains of the model after reinforced learning.

WDYT ? @abrichr

Copy link
Member

@abrichr abrichr Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LaPetiteSouris thank you for the information regarding GPT-2! We have been testing with GPT3.5-turbo and GPT-4.

For enforcing output structure, we have looked at implementing guardrails and LMQL/Guidance, but so far without much luck. If you could assist in this as well that would be greatly appreciate 🙏

You are correct that we need to implement evaluation metrics. Related: #173 #414

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abrichr I'll take a look into LMQL Guidance/guardrails topic as soon as I wrap up my work on these model evaluations/tuning topic.

return distance


def _calculate_similarity_per_action(ref_action, prediction_action):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple algorithm is to

  • Verify if 2 actions are of the same type
  • If they are both button pressing or releasing, the compare the key
  • If they are mouse movement, then calculate the Euclidean between the 2 clicked point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point is the compare how similar the reference action to the predicted action.

from openadapt import crud, models, utils

LOG_LEVEL = "INFO"
MAX_SCREEN_SIZE = (1920, 1080)
Copy link
Member

@abrichr abrichr Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about reading this from openadapt.models.recording: https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L29 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will take that into account.

@abrichr abrichr mentioned this pull request Jul 24, 2023
7 tasks
@LaPetiteSouris LaPetiteSouris changed the title WIP feat(model_tunings): Add script to fine-tune models WIP feat(model_tunings): Add script to evaluate models Jul 25, 2023
@LaPetiteSouris LaPetiteSouris changed the title WIP feat(model_tunings): Add script to evaluate models WIP feat(model_evaluation): Add script to evaluate models Jul 25, 2023
@LaPetiteSouris
Copy link
Contributor Author

Close and superseded by #444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants