Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

New Machine Learning Model Extension Version 2.0.alpha schema and (de)serialization, validation package #2

Merged
merged 113 commits into from
Apr 18, 2024

Conversation

rbavery
Copy link

@rbavery rbavery commented Dec 10, 2023

Here's a python package for validation, import, and export of the stac model metadata standard. I'm thinking this should be named stac-model on PYPI?

This uses the old model metadata standard I have been using and ports my pydantic code. I'll update this to get it in-line with the DLM spec and then it'll be ready for a code review.

I've linked everything in the README assuming this will be transferred to the stac-extensions repo pretty soon, which I think will help with visibility and adoption.

cc @fmigneault-crim

@rbavery rbavery changed the title initial package structure with example json initial python package for model metadata validation Dec 10, 2023
@fmigneault
Copy link
Collaborator

@rbavery
Great work! Thanks for this contribution. I will wait until you consider it ready for review.
I agree with stac-model on PyPI if the long-term decision is for this extension to replace ml-model extension.

@fmigneault fmigneault self-assigned this Dec 12, 2023
@fmigneault fmigneault self-requested a review December 12, 2023 16:48
@rbavery
Copy link
Author

rbavery commented Jan 6, 2024

Hey @fmigneault happy new year! Here's an update on progress with the stac_model validator and updates to the extension.

I've made some substantial updates and looking to wrap this PR up early next week. I've made some extensive edits to the schema for more tasks and using common metadata objects where possible (Statistics, Raster, Asset Object).

The hackmd for this README is at https://hackmd.io/@cHP95b4sTDWQdP7uy1Vv7A/rkneCaru6 If you have any time to comment on it in the coming weeks I'd much appreciate it! You can also refer to this Changelog where I've tried to keep track of some substantial updates.

I'll need to make some further changes to make sure this complies with STAC Extension guidelines before this is ready to merge.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ rbavery for this first draft!

I didn't dig too much in the code (yet) since I felt there was already enough items to discuss about from the spec itself.

I'm guessing most of the generic files under stac_model such as editorconfig, changes, etc. would be moved at the root? Is this only temporary while working on it from another source repo copy?

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Outdated

- Examples:
- [Example with a UNet trained with thelper](examples/item.json)
- [Example with a ??? trained with torchgeo](examples/item.json) TODO update example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previously mentioned UNet is what the contents of https://github.com/crim-ca/dlm-extension/blob/main/examples/model-arch-summary.txt refers to.
This is based on https://github.com/nyoki-mtl/pytorch-segmentation/blob/master/src/models/decoder.py that uses a generic pytorch model.
If you can provide a more relevant model with torchgeo, then I'm fine with updating this reference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I'd like to use torchgeo for the example since I think it is the most popular framework specific to EO for training deep learning models. They have pretrained models on EO datasets, Sentinel-2, Landsat, and they intend to host others.

I've been working with their 13 band pretrained Sentinel-2 Resnet model and was thinking of proposing this as the new example: https://torchgeo.readthedocs.io/en/stable/tutorials/trainers.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree to use the more popular torchgeo. Just need to update the example with the corresponding output. The Sentinel-2 Resnet is good. I've been using it also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The https://github.com/rbavery/dlm-extension/blob/validate/stac_model/example.json seems appropriate. Is there a persistent location where the .pt could be hosted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface is down now but I'll upload the Torchscript version of https://huggingface.co/torchgeo/resnet18_sentinel2_all_moco/ I've been working with that corresponds to the example.json

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming back to this, I can upload this after remaking a notebook demo with the new schema and a new version of stac_model

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
@rbavery
Copy link
Author

rbavery commented Jan 8, 2024

Thanks @ rbavery for this first draft!

I didn't dig too much in the code (yet) since I felt there was already enough items to discuss about from the spec itself.

I'm guessing most of the generic files under stac_model such as editorconfig, changes, etc. would be moved at the root? Is this only temporary while working on it from another source repo copy?

Yes should have specified but let's focus on discussing the spec! I can move this to root once it is finalized.

README.md Outdated Show resolved Hide resolved
Copy link
Author

@rbavery rbavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault here's my review from rbavery#2, it's looking really comprehsensive. Looking forward to using this extension!

README.md Outdated
| `detection` | `detection` | Generic detection of the "presence" of objects or entities, with or without positions. |
| `object-detection` | *n/a* | Task corresponding to the identification of positions as bounding boxes of object detected in the scene. |
| `segmentation` | `segmentation` | Generic tasks that regroups all types of segmentations tasks consisting of applying labels to pixels. |
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration of similar instances. |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration of similar instances. |
| `semantic-segmentation` | *n/a* | Specific segmentation task where all pixels are attributed labels, without consideration for segments as unique objects. |

README.md Show resolved Hide resolved
- `MXNet`
- `Keras`
- `Caffe`
- `Weka`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should Weka be listed? I've never heard of it or seen it in the wild.

Here's a suggested reordering based on my subjective interpretation of current popularity + longevity.

I also added rgee and spatialRF to showcase some R options. especially in academia, lots of folks use R, particularly random forest models for semantic segmentation.

I removed Caffe (no updates in 4 years) and MxNet (archived last year). I don't think anyone will publish models for these frameworks.

Removed ONNX since it isn't a training framework and I think the purpose of this field is to describe the framework used to train the model. this might be different than the inference runtime and format.

Suggested change
- `Weka`
- `PyTorch`
- `TensorFlow`
- `Scikit-learn`
- `Huggingface`
- `Keras`
- `rgee`
- `spatialRF`
- `JAX`
- `PyMC`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weka is not as common as others, but I have seen it once or twice. I prefer to have the option than not.
I think it is a good idea to include ONNX. Remember that MLM is not for training only. It can be used for inference runtime.
I don't think popularity or older frameworks should be ignored. People might want to support older algorithms that work well.
Will add the R-based references, those are good examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'm fine to include Weka then and include the old frameworks.

I think it is a good idea to include ONNX. Remember that MLM is not for training only. It can be used for inference runtime.

I totally agree (in fact I don't think MLM is a good fit for describing how to train a model), but we call out that this framework field should denote the framework used for training in the table. ONNX isn't used to train ML models. Instead, that can be denoted in the asset details. I f we include it, someone could get confused and not supply the actual framework used for training.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated
- `wrap-fill-outliers`
- `wrap-inverse-map`

See [OpenCV - Normalization Flags](https://docs.opencv.org/4.x/d2/de8/group__core__array.html#ga87eef7ee3970f86906d69a92cbf064bd)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this reference to normalization flags needs to be switched with the earlier reference to interpolation/resize methods.

long term I'd be interested in picking a different reference than OpenCV's C++ documentation, since the OpenCV lib is lower level than most folks encounter and the docs are a bit hard to follow (python programmer might get confused with the C data types for example). But I think this is better than us rolling our own.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Ideally, MLM would have its own listing and description (with OpenCV or other libs as references for corresponding implementations), but I didn't feel like replicating the formulas provided by OpenCV here. Some normalizations have very subtle differences, so it is not really trivial to describe properly instead of simply showing the actual formula.

I'm open to other references, but unless we got them really quickly and exhaustively, I'd leave them to a follow-up PR not to delay this one further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's do this later, no need to block this

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated

| Artifact Type | Description |
|--------------------|--------------------------------------------------------------------------------------------------------------------------|
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
| `.pt2` | A model artifact obtained using [Pytorch's AOTInductor](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html). |

I think this is a necessary edit since torch.compile is an API for compiling pytorch nn.Modules. it's used to speed up Pytorch code in general, for training or inference. Many of the backend internals that make torch.compile work are used in AOTInductor (the tool that creates compiled model artifacts) but they aren't the same thing and I think here we want to refer to artifacts produced by AOTInductor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some reference mentions of AOTInductor and torch.export. Let me know what you think of the edit.

I must say, though, I'm not 100% convinced about these definitions included in the spec (or rather, how explicit to be about them). They are still fairly prone to change soon. Even the doc mentions that many terms are used interchangeably.

https://pytorch.org/docs/main/torch.compiler.html#torch-compiler

In some cases, the terms torch.compile, TorchDynamo, torch.compiler might be used interchangeably in this documentation.

There is also a lot of ambiguity about torch.export using TorchDynamo under the hood, but then, torch.compile accepts backend='inductor' for TorchInductor (https://pytorch.org/docs/main/torch.compiler.html#torch-compiler), although they are also listed as distinct technologies under torch.compiler. And then, there is torch.export.save/torch.export.save for .pt2 (https://pytorch.org/docs/main/export.html#serialization) as parallel to the torch.save/torch.load for .pt.

I do not want to make the maintenance of the spec a burden for any new subtle change from pytorch. It's a fast-evolving ecosystem. Since this is an open field, I fear it is getting slightly too specific for this framework.

README.md Outdated
| Artifact Type | Description |
|--------------------|--------------------------------------------------------------------------------------------------------------------------|
| `torch.compile` | A model artifact obtained by [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). |
| `torch.jit.script` | A model artifact obtained by [`TorchScript`](https://pytorch.org/docs/stable/jit.html). |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `torch.jit.script` | A model artifact obtained by [`TorchScript`](https://pytorch.org/docs/stable/jit.html). |
| `torchscript` | A model artifact obtained by [`TorchScript Scripting`](https://pytorch.org/docs/stable/jit.html) and/or [`TorchScript Tracing`](https://pytorch.org/docs/stable/generated/torch.jit.trace.html). |

There are two types of graph capture in torchscript, trace and script. I think we can either enumerate both or leave only one option. Either traced or scripted models can be loaded the same way so I favor only one field for both of them.

Somewhat confusingly, both can also be used together though this is not common. https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/

Since Torchscript is bring phased out in favor of AOTInductor, I think we shouldn't make this too complex and only provide them as one field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other comment about the export. I'm not 100% convinced about this. I got somewhat mixed feelings about it.
I find that torchscript might just make it more ambiguous by not being explicit (about the actual function) whether it refers to torch.jit.trace or torch.jit.script, but then, it might also just be a technicality about pytorch that we should omit for MLM since this table is already extremely PyTorch-opinionated.

Also, I would like to remind that, even if things are being phased out, it does not mean we can ignore them. Some code/artifacts exists already, and users might what to represent them with the corresponding old technology.

@fmigneault
Copy link
Collaborator

@rbavery
I do not have the access to push directly to your repo, so you might need to re-merge https://github.com/crim-ca/dlm-extension/tree/validate into yours to reflect the changes here.

I also thought a bit more about the media-type for the artifact, and I think it is better to guide users in a single direction than multiple combinations.
Let me know about this 1fb5f21#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R508-R531

@fmigneault
Copy link
Collaborator

Almost ready to merge!
Worked around most linting issues and tests.
Only got this issue to complete: stac-utils/stac-node-validator#78

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants