Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix JSON schema with MLM fields + support pydantic/pystac objects #2

Merged
merged 25 commits into from
Apr 9, 2024

Conversation

fmigneault
Copy link
Collaborator

PR just to demonstrate the changes from crim-ca#2

@rbavery
Copy link
Owner

rbavery commented Mar 30, 2024

@fmigneault awesome thanks. I can review this next week when it is ready.

@fmigneault
Copy link
Collaborator Author

@rbavery
I think I have addressed most of the editorial items.
I'll work on the JSON-schema definition to reflect the descriptions at the start of next week.

@fmigneault
Copy link
Collaborator Author

@rbavery Almost there with the JSON schema. Only a few definitions about mlm:output and the MLM-roles left to implement. I did a lot of adjustments to the README to place the sections closer to where they can mention first. There was a lot of back and forth between Item/Asset fields when following links.

@fmigneault fmigneault marked this pull request as ready for review April 5, 2024 04:30
@fmigneault fmigneault changed the title [wip] address PR comments fix JSON schema with MLM fields + support pydantic/pystac objects Apr 5, 2024
@rbavery rbavery merged commit 9d14ac6 into rbavery:validate Apr 9, 2024
Copy link
Owner

@rbavery rbavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault this looks great. Some requests to edit the documentation and revised schema

| `detection` | `detection` | Generic detection of the "presence" of objects or entities, with or without positions. |
| `object-detection` | *n/a* | Task corresponding to the identification of positions as bounding boxes of object detected in the scene. |
| `segmentation` | `segmentation` | Generic tasks that regroups all types of segmentations tasks consisting of applying labels to pixels. |
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration of similar instances. |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration of similar instances. |
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration for segments as unique objects. |

such a model that produces pixel-wise "classifications" should be attributed the `segmentation` task
(and more specifically `semantic-segmentation`) rather than `classification`. To avoid this kind of ambiguity,
it is strongly recommended that `tasks` always aim to provide the most specific definitions possible to explicitly
describe what the model accomplishes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great explanation here!

- `MXNet`
- `Keras`
- `Caffe`
- `Weka`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should Weka be listed? I've never heard of it or seen it in the wild.

Here's a suggested reordering based on my subjective interpretation of current popularity + longevity.

I also added rgee and spatialRF to showcase some R options. especially in academia, lots of folks use R, particularly random forest models for semantic segmentation.

I removed Caffe (no updates in 4 years) and MxNet (archived last year). I don't think anyone will publish models for these frameworks.

Removed ONNX since it isn't a training framework and I think the purpose of this field is to describe the framework used to train the model. this might be different than the inference runtime and format.

Suggested change
- `Weka`
- `PyTorch`
- `TensorFlow`
- `Scikit-learn`
- `Huggingface`
- `Keras`
- `rgee`
- `spatialRF`
- `JAX`
- `PyMC`


### Accelerator Enum
In most cases, this should correspond to common library names of well-established ML frameworks.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In most cases, this should correspond to common library names of well-established ML frameworks.
This should correspond to the common library name of the well-established ML framework used to train the model.

- `wrap-fill-outliers`
- `wrap-inverse-map`

See [OpenCV - Normalization Flags](https://docs.opencv.org/4.x/d2/de8/group__core__array.html#ga87eef7ee3970f86906d69a92cbf064bd)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this reference to normalization flags needs to be switched with the reference to interpolation.

long term I'd be interested in picking a different reference than OpenCV's C++ documentation, since the OpenCV lib is lower level than most folks encounter and the docs are a bit hard to follow (python programmer might get confused with the C data types for example). But I think this is better than us rolling our own.


#### Model Artifact Media-Type

Not all ML framework, libraries or model artifacts provide explicit media-type. When those are not provided, custom
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Not all ML framework, libraries or model artifacts provide explicit media-type. When those are not provided, custom
Not all ML framework, libraries or model artifacts provide an explicit media-type. When those are not provided, custom


This value can be used to provide additional details about the specific model artifact being described.
For example, PyTorch offers various strategies for providing model definitions, such as Pickle (`.pt`), TorchScript,
or the compiled approach. Since they all refer to the same ML framework,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or the compiled approach. Since they all refer to the same ML framework,
or the upcoming [Ahead-Of-Time Compiled .pt2 format](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html). Since they all refer to the same ML framework,


| Artifact Type | Description |
|--------------------|--------------------------------------------------------------------------------------------------------------------------|
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
| `.pt2` | A model artifact obtained using [Pytorch's AOTInductor](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html). |

I think this is a necessary edit since torch.compile is an API for compiling pytorch nn.Modules. it's used to speed up Pytorch code in general, for training or inference. Many of the backend internals that make torch.compile work are used in AOTInductor (the tool that creates compiled model artifacts) but they aren't the same thing and I think here we want to refer to artifacts produced by AOTInductor

| Artifact Type | Description |
|--------------------|--------------------------------------------------------------------------------------------------------------------------|
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
| `torch.jit.script` | A model artifact obtained by [`TorchScript`](https://pytorch.org/docs/stable/jit.html). |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `torch.jit.script` | A model artifact obtained by [`TorchScript`](https://pytorch.org/docs/stable/jit.html). |
| `torchscript` | A model artifact obtained by [`TorchScript Scripting`](https://pytorch.org/docs/stable/jit.html) and/or [`TorchScript Tracing`](https://pytorch.org/docs/stable/generated/torch.jit.trace.html). |

There are two types of graph capture in torchscript, trace and script. I think we can either enumerate both or leave only one option. Either traced or scripted models can be loaded the same way so I favor only one field for both of them.

Somewhat confusingly, both can also be used together though this is not common. https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/

Since Torchscript is bring phased out in favor of AOTInductor, I think we shouldn't make this too complex and only provide them as one field.

data_type: DataType


# MLMClassification: TypeAlias = Annotated[
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants