Add explicit dependency checks for normalization details #26

fmigneault · 2024-10-03T15:54:21Z

@fmigneault cloned issue crim-ca/mlm-extension#10 on 2024-05-01:

🚀 Feature Request

Within a mlm:input definition, when a norm_type is specified, it is expected that the corresponding statistics details are provided for the relevant normalization technique.

For example, a z-score needs the corresponding mean and stddev (minimally), while min-max would require the minimum and maximum statistics.

For the JSON schema, the dependencies property could be used to check that when norm_type is equal to some constant, the subset of corresponding statistics are defined + minItems: 1 or similar. In the stac_model definition, a @model_validator(mode="after") on the ModelInput could be applied.

🔉 Motivation

Ensure that the intended definition is properly specified.

If the user forgot to apply some parameters unintentionally, this check would ensure the error is caught early, increasing model reusability. Not all cases can necessarily be covered, but most common normalization techniques could be handled.

📡 Alternatives

There are no explicit validation currently.

📎 Additional context

n/a

The text was updated successfully, but these errors were encountered:

fmigneault · 2024-10-30T17:40:55Z

Thinking more about this, it seems there might be an ambiguity regarding the definitions of norm_type, statistics and their relationship (and therefore, whether a check enforcing the relationship should be defined or not). For example, norm_type: min-max could make use of pre-computed minimum and maximum statistics established by the training dataset to fit the input data with the model expected ranges, but this is not necessarily the case. It could be possible that the desired outcome by min-max is instead to be relative to the input data and be computed dynamically, rather than from a predefined min-max based on a training dataset.

Therefore, the statistics object should probably remain only informational, since the 2nd case would not need them. Also, maybe other norm_type values should be defined to avoid this ambiguity (e.g.: min-max-absolute vs min-max-relative?).

rbavery · 2024-10-30T18:08:14Z

I agree there's some ambiguity here but I think we can avoid this by making our definitions clear, possibly by introducing another field. I'd like to not just make these fields informational, we use them on our backend for determining what preprocessing to apply. I feel like this metadata is so crucial to being able to reproduce model inference and is so often lost or not made clear that enforcing it in the schema would be valuable (if norm_type is at all specified).

I think what we currently call min-max normalization is for the purpose of scaling values to another precision or to the same range. In practice I think it is always used in a relative manner for imagery, though I could be wrong. But in any case it's not really for normalization (adjusting the distribution relative to the population). This line of thinking follows the definitions used by Kaggle https://www.kaggle.com/code/alexisbcook/scaling-and-normalization

Maybe we are overloading this norm_type field to define the method for both scaling and normalization which are not always the same. Both may be done, neither may be done, or one or the other may be applied.

Should we instead have a scale_type field with a reduced set of scaling approaches taken from norm_type ? And keep the norm_type options that change the distribution of the overall dataset? We could also advise that dynamic operations like "min-max-relative" should be applied as a processing expression, which I think currently assumes per-sample computation.

fmigneault added the enhancement New feature or request label Oct 3, 2024

fmigneault mentioned this issue Oct 3, 2024

Add explicit dependency checks for normalization details crim-ca/mlm-extension#10

Closed

rbavery assigned rbavery and fmigneault Oct 29, 2024

rbavery mentioned this issue Oct 29, 2024

Update extension to 'candidate' maturity level #37

Open

10 tasks

fmigneault mentioned this issue Oct 30, 2024

refactor norm-type/statistics mlm:input properties #39

Merged

10 tasks

fmigneault closed this as completed in #39 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit dependency checks for normalization details #26

Add explicit dependency checks for normalization details #26

fmigneault commented Oct 3, 2024

🚀 Feature Request

🔉 Motivation

📡 Alternatives

📎 Additional context

fmigneault commented Oct 30, 2024

rbavery commented Oct 30, 2024

Add explicit dependency checks for normalization details #26

Add explicit dependency checks for normalization details #26

Comments

fmigneault commented Oct 3, 2024

🚀 Feature Request

🔉 Motivation

📡 Alternatives

📎 Additional context

fmigneault commented Oct 30, 2024

rbavery commented Oct 30, 2024