WIP feat(model_evaluation): Add script to evaluate models #420

LaPetiteSouris · 2023-07-23T13:54:37Z

What kind of change does this PR introduce?

Solve #421

Summary

Build a small dataset from the last recording. The dataset contains the following attribute
reference_window_dict, reference_action_dicts, active_window_dict (which is also reference_window_dict for training dataset
Build a simple algorithm to score a prediction. The algorithm is as simple as following

If the reference action and the predicted action is of different type, return 0, as they are not similar
If the reference action and predicted action is of same type press or release, compare the pressed/released key. If the same key is pressed or released, return score 1, else return score 0
If the reference action and predicted action is of type mouse movement, calculate the Euclidean distance between the 2 points (reference and predicted). Normalize the distance between 0-1 based on the max size of the screen. Then invert the score. I.e: 2 identical points should have score of 0 and vice-versa.

For each entry in the dataset, build a prompt and ask for prediction from GPT-2. Then calculate the score.

Next Steps

Use Reinforced Learning to improve the model and re-run the above steps to compare if the score has been improved or not

Checklist

My code follows the style guidelines of OpenAdapt
I have performed a self-review of my code
If applicable, I have added tests to prove my fix is functional/effective
I have linted my code locally prior to submission
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
New and existing unit tests pass locally with my changes

How can your code be run and tested?

python -m openadapt.models_tuning.fine_tune_models

Other information

LaPetiteSouris · 2023-07-23T13:55:56Z

openadapt/models_tuning/fine_tune_models.py

+            clean_up_tokenization_spaces=True,
+        )
+        active_action_dicts = get_action_dict_from_completion(completion)
+        logger.debug(f"active_action_dicts=\n{pformat(active_action_dicts)}")


@abrichr May be am I wrong here, but using local gpt-2 model from transformers package, I do not get a lot of correct prediction. Must of the prediction given by gpt-2 model cannot be parsed to correct action_event.

Is there any special tricks that I miss ?

If this is truly the case, then either my understand of how to build the prompt from Event is incorrect, or we need to re-consider the way a prompt is built.

May be related to #327 and #419

I was thinking about waiting until #419 solved. We can fine-tune models with like 50% less time commitment by then. Because it is hard to reasonably conclude that RF actually helps us without a baseline on what to measure and how to measure things. I.e: we would spend tons of time on fine-tuning the model, then there is no way to "prove" that the output model actually worth it or not.

Also, most of the code/tests used in #419 will be re-used here to measure the performance gains of the model after reinforced learning.

WDYT ? @abrichr

@LaPetiteSouris thank you for the information regarding GPT-2! We have been testing with GPT3.5-turbo and GPT-4.

For enforcing output structure, we have looked at implementing guardrails and LMQL/Guidance, but so far without much luck. If you could assist in this as well that would be greatly appreciate 🙏

You are correct that we need to implement evaluation metrics. Related: #173 #414

@abrichr I'll take a look into LMQL Guidance/guardrails topic as soon as I wrap up my work on these model evaluations/tuning topic.

LaPetiteSouris · 2023-07-23T13:59:44Z

openadapt/models_tuning/fine_tune_models.py

+    return distance
+
+
+def _calculate_similarity_per_action(ref_action, prediction_action):


The simple algorithm is to

Verify if 2 actions are of the same type

If they are both button pressing or releasing, the compare the key

If they are mouse movement, then calculate the Euclidean between the 2 clicked point.

The whole point is the compare how similar the reference action to the predicted action.

abrichr · 2023-07-24T01:11:52Z

openadapt/models_tuning/fine_tune_models.py

+from openadapt import crud, models, utils
+
+LOG_LEVEL = "INFO"
+MAX_SCREEN_SIZE = (1920, 1080)


What do you think about reading this from openadapt.models.recording: https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L29 ?

Thanks. I will take that into account.

LaPetiteSouris · 2023-07-29T11:21:26Z

Close and superseded by #444

feat(model_tunings): WIP Add script to fine-tune models

dcc466e

LaPetiteSouris commented Jul 23, 2023

View reviewed changes

abrichr reviewed Jul 24, 2023

View reviewed changes

abrichr mentioned this pull request Jul 24, 2023

feat: CompletionProvider API #379

Open

7 tasks

LaPetiteSouris changed the title ~~WIP feat(model_tunings): Add script to fine-tune models~~ WIP feat(model_tunings): Add script to evaluate models Jul 25, 2023

LaPetiteSouris changed the title ~~WIP feat(model_tunings): Add script to evaluate models~~ WIP feat(model_evaluation): Add script to evaluate models Jul 25, 2023

LaPetiteSouris closed this Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP feat(model_evaluation): Add script to evaluate models #420

WIP feat(model_evaluation): Add script to evaluate models #420

LaPetiteSouris commented Jul 23, 2023 •

edited

Loading

LaPetiteSouris Jul 23, 2023

LaPetiteSouris Jul 23, 2023 •

edited

Loading

abrichr Jul 24, 2023 •

edited

Loading

LaPetiteSouris Jul 25, 2023

LaPetiteSouris Jul 23, 2023

LaPetiteSouris Jul 23, 2023

abrichr Jul 24, 2023 •

edited

Loading

LaPetiteSouris Jul 25, 2023

LaPetiteSouris commented Jul 29, 2023

		return distance


		def _calculate_similarity_per_action(ref_action, prediction_action):

WIP feat(model_evaluation): Add script to evaluate models #420

WIP feat(model_evaluation): Add script to evaluate models #420

Conversation

LaPetiteSouris commented Jul 23, 2023 • edited Loading

LaPetiteSouris Jul 23, 2023

Choose a reason for hiding this comment

LaPetiteSouris Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

abrichr Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

LaPetiteSouris Jul 25, 2023

Choose a reason for hiding this comment

LaPetiteSouris Jul 23, 2023

Choose a reason for hiding this comment

LaPetiteSouris Jul 23, 2023

Choose a reason for hiding this comment

abrichr Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

LaPetiteSouris Jul 25, 2023

Choose a reason for hiding this comment

LaPetiteSouris commented Jul 29, 2023

LaPetiteSouris commented Jul 23, 2023 •

edited

Loading

LaPetiteSouris Jul 23, 2023 •

edited

Loading

abrichr Jul 24, 2023 •

edited

Loading

abrichr Jul 24, 2023 •

edited

Loading