-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR to handle dataset information in TIMDEX #125
Conversation
How this addresses that need: * Adds new ADR markdown in 'docs/adrs' Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-193
In this way, the TIMDEX data model is lacking a field that can be used to describe the "spatial type or structure of a dataset". Maybe stated more generally, a field that describes | ||
the structure of the data _inside_ the format noted in `format`. | ||
|
||
It is also worth noting that data values for both facets -- present in fields `dct_format_s` and `gbl_resourceType_sm` from the Aardvark records -- will be controlled values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if ResourceType
is more or less confusing than Form
for the new object name? I do think Form
works and I will support it, but the fact that at least one other use case of this type of data refers to it as ResourceType
seems worth explicitly considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JPrevost - do you think ResourceType
could still take values formerly used in literary_form
? e.g.
"resource_type":[
{
"value":"Nonfiction",
"kind":"Literary Form"
}
]
I'd also wonder how this would to content_type
which we define in the spec as:
"Type of content of item; e.g., "still image", "text", etc."
Sketching an example:
{
"content_type": "Geospatial data", # this feels pretty well established
"format": "Shapefile",
"resource_type": [
{
"value":"Polygon",
"kind": "Data Type"
},
{
"value":"Vector",
"kind": "Data Type"
}
]
}
Given the GIS data, it seems that whatever we name this field it should allow multiple values (some MIT and OGM records have 3-4 "Data Type" qualities in the metadata). I'd worry that And, that content_type
and resource_type
seem conceptually similar, but content_type
is single value string, while resource_type
is multivalued object.format
kind of sits between them hierarchically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think literary_form
could still collapse into this object with either the name form
or resourceType
. I'm not sure which I prefer yet fwiw. I think form
has a meaning in our current language that is not how we are using it here. I also think that we aren't using it wrong fwiw, but it isn't the first meaning I would intuit when trying to use the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting that our current content_type
appears to be an array of Strings and not just a single String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also nothing that our mappings for our sources should be confirmed.
We have thesis records from Alma and DSpace for the same thesis that have different content_types, literary_forms, and formats.
This is not directly related to if we create a new object, but the lack of consistency across sources feels important to note as we consider what these closely related fields mean.
Example:
{
"source": "DSpace@MIT",
"title": "Gardens of resistance",
"contributors": [
{
"kind": "author",
"value": "Jhaveri, Nynika\n (Nynika P.)"
},
{
"kind": "other",
"value": "Massachusetts Institute of Technology. Department of Architecture."
},
{
"kind": "department",
"value": "Massachusetts Institute of Technology. Department of Architecture"
}
],
"contentType": [
"Thesis"
],
"literaryForm": null,
"format": "electronic resource"
},
{
"source": "MIT Alma",
"title": "Gardens of resistance",
"contributors": [
{
"kind": "Not specified",
"value": "Jhaveri, Nynika (Nynika P.)"
},
{
"kind": "Not specified",
"value": "Massachusetts Institute of Technology. Department of Architecture"
}
],
"contentType": [
"Manuscript language material"
],
"literaryForm": "Nonfiction",
"format": null
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting that our current
content_type
appears to be an array of Strings and not just a single String.
Good catch! Thanks. Crossing that out from above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in agreement with the core proposal here. I want to consider if Form
is the best label for this new Object.
I may have some very small additional details to add some more context and once I have had a chance to do that and we have considered Form
vs ResourceType
vs ?
I will feel confident in approving this as-written.
Solid write up, thanks!
* "Data Type" | ||
* values like: `[Polygon, Point, Raster, Image, etc.]` | ||
* "Format" | ||
* values like: `[Shapefile, TIFF, GeoTIFF, JPEG, etc.]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, these feel more like file_formats
, a field which was not used in MITAardvark
but maybe should have been?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that theoretically allow:
content_type = Geospatial data
format = [Polygon, Point, Raster, Image, etc.]
file_formats = [Shapefile, TIFF, GeoTIFF, JPEG, etc.]
Or do we want to keep file_formats
focused on MIME type values like application/pdf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I think that might be another option to document?
i.e. move [Shapefile, TIFF, GeoTIFF, JPEG, etc.]
to file_formats... and [Polygon, Point, Raster, Image, etc.]
slide into format
... no data model changes required?
(I'm not saying I'd pick that option, but it feels like an option to document)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can write that up and push a commit, if that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting @ehanson8. file_formats
felt like a non-starter because other Transmog sources were writing mimetypes... but would probably be relatively easy to update them.
I'm liking that proposal quite a bit, but did want to note that would mean the "Format" filter in the UI would actually be coming from file_formats
and "Data Type" filter would be coming from format
. Would probably want to document that in the UI why this mapping was selected.
For any other sources that write to format
, I'm assuming we might want to consider if the values they are writing might then belong in file_formats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed commit with the 4th option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the addition of the fourth option as one to consider in the ADR.
#### Option 4 - Use `file_formats` for current `format` values and `format` for Data Type values | ||
|
||
In this approach, the current `MITAardvark.format` values would shift to the previously unused `MITAardvark.file_formats` property and the Data Type values would be stored in `MITAardvark.format` | ||
|
||
Example: | ||
```json | ||
|
||
{ | ||
"content_type": "Geospatial data", | ||
"format": ["Polygon", "Point", "Raster", "Image"], | ||
"file_formats": ["Shapefile", "TIFF", "GeoTIFF", "JPEG"] | ||
} | ||
``` | ||
|
||
Pros: | ||
* does not require TIMDEX data model changes | ||
|
||
Cons: | ||
* `file_formats` has previously only stored MIME type values, such as `application/pdf` | ||
* may require explanation of the facet mapping in the UI documentation | ||
* may require updates of other transform classes for consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding Option 4 pretty appealing now.
One question: what if we had a record for a print map? If the Aardvark record had dct_format_s="Print map"
, are we okay with this value mapping to file_formats
?
Maybe helpful for discussion, this spreadsheet has dct_format_s
and gbl_resourceType_sm
values across all MIT and OGM records.
In support of Option 4, I'm hard pressed to find a value that isn't a digital file format in essence (where even msot of the complex ones could be mapped to "Online resource" or something similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question: if we go this route, are we okay losing mimetype as a value in a TIMDEX record? My inclination is that it would be okay as I'd imagine this field is not widely used yet.
If we did want specifics about mimetypes, perhaps that could get added to the links
section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did want specifics about mimetypes, perhaps that could get added to the
links
section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.
Agree that including it with the actual download link would be more useful, Link.file_type
attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: what if we had a record for a print map? If the Aardvark record had
dct_format_s="Print map"
, are we okay with this value mapping tofile_formats
?
If we do go this route, we may want to change the name from file_formats
so it's more broadly applicable but I don't have a suggestion yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: what if we had a record for a print map? If the Aardvark record had
dct_format_s="Print map"
, are we okay with this value mapping tofile_formats
?If we do go this route, we may want to change the name from
file_formats
so it's more broadly applicable but I don't have a suggestion yet
I've had time to look more closely at the spreadsheet linked above, and I don't think we'll encounter "Print Maps", at least in the gismit
or gisogm
sources. So maybe this is moot? I think it's safe to say that the values we get back from dct_format_s
in the MITAardvark files will uniformly be digital in nature. And therefore, file_formats
could be applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did want specifics about mimetypes, perhaps that could get added to the
links
section? Seems more meaningful for a mimetype to point precisely to an actual resource vs the record as an abstract whole.Agree that including it with the actual download link would be more useful,
Link.file_type
attribute?
If we did repurpose file_formats
, and therefore needed to move those mimetype data points to another field, and we did like putting them on the links
entries, I might propose mimetype
explicitly; no ambiguity there.
If we go that route, I think we'd want to make sure the links made sense though. If we're just pointing someone back to an Alma record, or ASpace page, I'm not sure that an link.mimetype=application/pdf
makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can weigh in on what to do with the mimetype values at the moment, but as far as the rest of this choice is concerned, I think I'd prefer option 4 over the others - at least as I understand them now.
To be clear, this option would leave the current literary_form
field untouched?
@ghukill and @ehanson8 Thank you for writing up detailed descriptions for possible solutions. Here are my thoughts:
Question: Would it be a big lift if we were to:
Such that....we can specify---
? |
A couple thoughts here... If we did this, I'd propose naming the class As for this change itself, a couple of considerations:
Having been down this road a bit with preservation systems, I think we might want to stay out of the game of trying to juggle "friendly" and mimetype forms. There are python library we could lean on to give us one from the other, but there are always edge cases. Zooming out a bit -- something I think Option 4 kind of raises as a question -- perhaps we don't need mimetypes at all? Or, not needed for To me, the appeal of Option 4 would be:
|
As if this discussion doesn't have enough context, wanted to add a bit more. After some normalization of terms in the GeoHarvester (specifically This gives a pretty good preview of what kind of actual values we're looking to slot into the data model, and see in the UI. One observation: for the OGM I would propose we filter this down in Transmogrifier to only allow the values we want to see for "Data Types", e.g. "Polygon", "Point", etc., and then put the rest into |
Given the considerations you mentioned, it seems like the additional changes I proposed would prove challenging, so please feel ignore! I do like the idea of "updat[ing] sources that do currently write to file_formats to write "human" friendly forms". Curious to see what others think! 🤓 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns with the third option, which I've tried to explain above. Ultimately, though, I don't know that I'm familiar enough with this problem space to attempt to veto any of the four options.
I have a separate question about our change processes which this has brought up for me - how significant an undertaking is it for the data model to change? Several of these options include "pro" statements about not requiring the data model to change, and I want to make sure I understand that prospect accurately. Are we avoiding changes to the data model because:
- Doing so actively causes problems, or
- Because it is a complex process that must be undertaken thoughtfully (and which takes significant time), or
- Because we've yet to make these sorts of data model changes, so the process is completely undefined?
* "Data Type" | ||
* values like: `[Polygon, Point, Raster, Image, etc.]` | ||
* "Format" | ||
* values like: `[Shapefile, TIFF, GeoTIFF, JPEG, etc.]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the addition of the fourth option as one to consider in the ADR.
``` | ||
|
||
Pros: | ||
* allows collapsing of `literary_form` field; noting some shared sentiment that this field might be too source-specific for TIMDEX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I don't know that I feel strongly enough to veto this option, I do have concerns.
While I agree that, in hindsight, literary_form
field seems to over-fit the data model to one data source, I'm not sure how comfortable I am creating a vocabulary that encompasses all of these values: nonfiction
, fiction
, point
, polygon
, raster
, image
Terms like point
, polygon
, and raster
feel like they address the internal structure of the record, but nonfiction
and fiction
are more about the content (akin to genre). Equivalent literary terms feel like they might be prose
, verse
, or drama
(which are values that I don't think appear in any current field?)
image
is a value I also struggle with in this set, as I understand it to be closely related to (if not synonymous with) raster
. I vaguely remember seeing raster data that is not an image, so maybe there's something here I'm missing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-bernhardt would your concern be addressed by the different kind
provided by this Object?
Example:
GIS record
{
"form": [
{
"value": "Polygon",
"kind": "Data Type"
},
{
"value": "Vector",
"kind": "Data Type"
}
]
}
Literary form example
{
"form": [
{
"value": "Fiction",
"kind": "Literary Form"
}
]
}
I remain unsure myself if kind
addresses the "sameness" across fields. When I imagine using this data in a User Interface, I find myself agreeing with your concern that the geo values looks super weird along with the literary form values. We could address that by exposing aggregations that are filtered to the separate kind
values but then it starts to feel slightly weird again of why are we combining things in OpenSearch that we would then split out again in GraphQL and thus nobody would ever use the combined data. It definitely starts to feel like if we are concerned we'll never use the data together -- because it is indeed "different" -- we need to consider if it ever should have been together in the first place. I'm still not sure.
#### Option 4 - Use `file_formats` for current `format` values and `format` for Data Type values | ||
|
||
In this approach, the current `MITAardvark.format` values would shift to the previously unused `MITAardvark.file_formats` property and the Data Type values would be stored in `MITAardvark.format` | ||
|
||
Example: | ||
```json | ||
|
||
{ | ||
"content_type": "Geospatial data", | ||
"format": ["Polygon", "Point", "Raster", "Image"], | ||
"file_formats": ["Shapefile", "TIFF", "GeoTIFF", "JPEG"] | ||
} | ||
``` | ||
|
||
Pros: | ||
* does not require TIMDEX data model changes | ||
|
||
Cons: | ||
* `file_formats` has previously only stored MIME type values, such as `application/pdf` | ||
* may require explanation of the facet mapping in the UI documentation | ||
* may require updates of other transform classes for consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can weigh in on what to do with the mimetype values at the moment, but as far as the rest of this choice is concerned, I think I'd prefer option 4 over the others - at least as I understand them now.
To be clear, this option would leave the current literary_form
field untouched?
Why these changes are being introduced: After multiple discussions about the various options outlined in the first version, it was decided to take another pass at the ADR context and decision. This reflects some discussions to emphasize the data model as an evolving one, in which this decision may feed into future decisions, while hopefully solving an immediate issue for a TIMDEX source. How this addresses that need: * Reworks ADR 0002 Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-200
Please find in this new commit a fairly major rework of the ADR:
The new proposal is a synthesis of previous options, and discussions this week. The data model changes would be minimal or even non-existent, but there is data model discussion in the ADR and implications for future work. Thanks again to all for the engagement! |
I wanted to point out one subtle change that might go unnoticed in this reworked form. Previously, examples for "Data Type" had shown This new proposal puts these values in |
2348b16
to
bb82f62
Compare
Agreed. Including the 'data' suffix feels friendlier to use cases beyond GeoData. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support this decision. While losing Geospatial data
is a shame, I agree that considering content_class
is out of scope of this ADR. (Though, I think I'd support adding that field, too.) In the meantime, if we need to identify whether a record is geospatial, we should be able to infer as much from source
.
Thank you to the DataEng team for your thoughtful consideration of this issue, and to Graham for the clear and detailed write-up!
@ghukill Option 5 sounds like the best option. I agree with @jazairi 's comment that the values for Also--thank you for the detailed ADR! I appreciate your time and effort into capturing all the ideas/learnings that came out of our teamwide discussions. These notes will serve as a helpful reference to understand the reasoning behind our TIMDEX data model(s). 📓 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also in favor of option 5, the added examples were very helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for facilitating this conversation, everyone. Based on the way the document now reads, I'd support choosing option 5 from those laid out. I'm also likely in favor of the future addition of a content_class
field.
The example values across a variety of sources was particularly helpful in coming to terms with this.
I see that the Decision area is still left as TBD, so I'm not sure that I'm in a position to approve as-is - but I think I'm with the others so far who have said they support that choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like my approval to signal a specific option. Could you update the status to Approved
and the decision to Option 5
and dismiss all current reviews.
I do support option 5, but it's unclear if all of the other approvals are approving option 5 or something else because the document doesn't have a decision listed :)
Thanks!
82cac11
to
44e9ef9
Compare
Thanks for the ADR logistical help here @JPrevost! ADR marked as "Accepted", decision logged, and all reviewers have been re-requested. |
Purpose and background context
This PR introduces a new Architecture Decision Record (ADR) that proposes to add a new
form
field to the TIMDEX data model.This discussion was kicked off when seeking to understand what TIMDEX fields to use for facet filters in a new geospatial TIMDEX UI. Reaching a decision on where to store this data in a TIMDEX record will unblock work for DataEng (work to store the data) and EngX (utilize the data for facets).
Rendered markdown for current version of the ADR: https://github.com/MITLibraries/transmogrifier/blob/GDT-193-adr-handle-data-forms/docs/adrs/0002-field-for-data-type-form-information.md
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)