Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit dependency checks for normalization details #26

Closed
fmigneault opened this issue Oct 3, 2024 · 2 comments · Fixed by #39
Closed

Add explicit dependency checks for normalization details #26

fmigneault opened this issue Oct 3, 2024 · 2 comments · Fixed by #39
Assignees
Labels
enhancement New feature or request

Comments

@fmigneault
Copy link
Collaborator

@fmigneault cloned issue crim-ca/mlm-extension#10 on 2024-05-01:

🚀 Feature Request

Within a mlm:input definition, when a norm_type is specified, it is expected that the corresponding statistics details are provided for the relevant normalization technique.

For example, a z-score needs the corresponding mean and stddev (minimally), while min-max would require the minimum and maximum statistics.

For the JSON schema, the dependencies property could be used to check that when norm_type is equal to some constant, the subset of corresponding statistics are defined + minItems: 1 or similar. In the stac_model definition, a @model_validator(mode="after") on the ModelInput could be applied.

🔉 Motivation

Ensure that the intended definition is properly specified.

If the user forgot to apply some parameters unintentionally, this check would ensure the error is caught early, increasing model reusability. Not all cases can necessarily be covered, but most common normalization techniques could be handled.

📡 Alternatives

There are no explicit validation currently.

📎 Additional context

n/a

@fmigneault
Copy link
Collaborator Author

Thinking more about this, it seems there might be an ambiguity regarding the definitions of norm_type, statistics and their relationship (and therefore, whether a check enforcing the relationship should be defined or not). For example, norm_type: min-max could make use of pre-computed minimum and maximum statistics established by the training dataset to fit the input data with the model expected ranges, but this is not necessarily the case. It could be possible that the desired outcome by min-max is instead to be relative to the input data and be computed dynamically, rather than from a predefined min-max based on a training dataset.

Therefore, the statistics object should probably remain only informational, since the 2nd case would not need them. Also, maybe other norm_type values should be defined to avoid this ambiguity (e.g.: min-max-absolute vs min-max-relative?).

@rbavery
Copy link
Collaborator

rbavery commented Oct 30, 2024

I agree there's some ambiguity here but I think we can avoid this by making our definitions clear, possibly by introducing another field. I'd like to not just make these fields informational, we use them on our backend for determining what preprocessing to apply. I feel like this metadata is so crucial to being able to reproduce model inference and is so often lost or not made clear that enforcing it in the schema would be valuable (if norm_type is at all specified).

I think what we currently call min-max normalization is for the purpose of scaling values to another precision or to the same range. In practice I think it is always used in a relative manner for imagery, though I could be wrong. But in any case it's not really for normalization (adjusting the distribution relative to the population). This line of thinking follows the definitions used by Kaggle https://www.kaggle.com/code/alexisbcook/scaling-and-normalization

Maybe we are overloading this norm_type field to define the method for both scaling and normalization which are not always the same. Both may be done, neither may be done, or one or the other may be applied.

Should we instead have a scale_type field with a reduced set of scaling approaches taken from norm_type ? And keep the norm_type options that change the distribution of the overall dataset? We could also advise that dynamic operations like "min-max-relative" should be applied as a processing expression, which I think currently assumes per-sample computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants