-
Notifications
You must be signed in to change notification settings - Fork 3
how to reference the ML model extension from other STAC items/collections #3
Comments
I think it makes sense to have it for STAC Items as well as Collections. The use case for STAC Items is, provided an output image with classification results, that image's metadata could be uploaded to STAC API with a definition of the STAC ML Model that produced it for data lineage. This somewhat overlaps with what the processing extension does, but adds a lot of lacking details (its expression object is not sufficient to represent a whole model inference pipeline). An alternative could also be to add an entry to Collection normally does a I don't think processing should be removed. There are cases where only a simple arithmetic expression to combine bands or perform simple pixel-wise manipulations are applied. Using ML-model for such cases would be overkill and too verbose in comparison to the simple format it offers. I agree that adding more examples would be the most useful. Otherwise, everything remains very convoluted with the amount of extensions combined for this kind of use with ML Model. |
I like this idea! The scenario I'm thinking of is when a user has not one STAC item but 100,000s of STAC Items. I think it's more preferably to not duplicate this relatively complex metadata across many items. Doing so may unnecessarily raise storage and search costs for large collections of items and make viewing an individual STAC Item more cluttered. Maybe we could standardize that the model metadata filename
Since it pertains to model training and validation, do you think the current plan should be to describe the splits in the ML AOI extension? And maybe eventually this is suggested for ML-model model extension to include at the Collection level?
Agreed that this isn't needed or doesn't need to be recommended for simple pixel manipulations. My concern is that recommending it at all might be confusing for folks looking to describe the minimal amount of metadata needed to discover and run an ML model.
This was my concern with including the processing extension as the top field in the object table for Data Object. But I'm fine with recommending it. I'd like to call out in each Object table what fields are required vs recommended vs optional/situational, and reorder them so required fields are at the top, does that sound helpful? |
The way the metadata is stored can be optimized to avoid duplication between items shared by a collection. This is an implementation detail in my opinion. Each STAC Item should report the information from the API individually because doing a STAC search might yield only partial or overlapping results. The STAC Items returned might not all be from the same collection, or might be accessed without going through the collection first. I think it is a good idea to suggest providing the derived link as a best practice. I'm not sure if it should be enforced though, since there is not a clear way to distinguish the model's link from any other
Yes. I had something like that in mind. I'm already using
The only fields I thought were interesting was
and
The others can be ignored since they are optional. |
Good point on the role suggestion. I think introducing something like a new role makes sense here and metadata providers would appreciate a clear requirement. In this PR I introduced a new role for referencing geoparquet and I think we could do something similar here to introduce an ml-model role. https://github.com/stac-utils/stac-geoparquet/blob/7cac0b08c06bff8773a49f7d4dd420ea777d965a/spec/stac-geoparquet-spec.md#referencing-a-stac-geoparquet-collections-in-a-stac-collection-json however I don't think this extension should be defining an Asset Object which has a role field, since Asset Object Fields are not searchable. Instead, if I'm reading this right, we would use a link object and define a new media type? I'm uncertain if that's the right way to reference this though. I'll make a separate issue for the processing discussion, thanks for your comments! |
I think the Asset Object could be sufficient even if it is not searchable. I think the main purpose is to provide data lineage such that one can understand where model predictions and derived data comes from. By accessing these references, it is then possible to reconstruct a pipeline of derived products. I don't think it would be a common use case of someone trying to use this link to search for all derived data from a given model. However, I'm not against adding more metadata either. A standardized relation type would need to be defined to avoid conflicts with other extensions. |
I agree that's the main purpose, but there should probably be some flat fields that folks will use to search. I could see folks wanting to search on the enums for tasks and accelerators defined in #2
I agree here it wouldn't be common but I could see it. I was thinking it would be more common to search for the source data given someone has the STAC model extension metadata. Or to find an ML model given some source data, which would be common in scenarios where a STAC dataset is published specifically as an ML training dataset. Should we have two relation types? One could be for referring to source data from the model json and the other could be for referring to the model from the source dataset json. For the model referring to the source, we could use the existing
For the source referring to the model (in the case of STAC training datasets) we could invent |
Yes. Those definitely need to be distinguished. I believe
For the model, I think simply |
If anyone has time and thoughts on how to catalog ML models, I have a wip rework of the DLM extension.
The PR: #2
The new README detailing the schema: https://hackmd.io/@cHP95b4sTDWQdP7uy1Vv7A/rkneCaru6
My main questions right now are:
Should this extension only exist for the collection level?
My take is yes since the ML AOI extension could handle specifying the specific train/val/test splits used to create a model. This inference focused extension could generally refer to the dataset/collection used with the model once, and reduce the redundancy of duplicating ML model information for each item representing a scene in a STAC Collection.
Is it ok to remove the processing extension?
My thinking here is yes, this is something that can be included in the collection json, but it doesn't need to be a requirement for the ML Model spec. Maybe we could offer examples of pairing essential ML Model extension fields for search and inference with fields from other extensions like processing that are more specific to the dataset qualities and might be useful for an ML practitioner to know.
The text was updated successfully, but these errors were encountered: