-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gdt 201 data type refactor #136
Conversation
Why these changes are being introduced: * Updating the TIMDEX data model in accordance with ADR 0002-field-for-data-type-form-information.md How this addresses that need: * Remap content_type to the gbl_resourceType_sm value * Remove gbl_resourceType_sm from get_subjects method * Update corresponding unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mapping change looks good. But I left a comment for discussion about whether or not we want to keep ["Not specified"]
as a default value for records now that we're slightly reimagining content_type
.
I would observe that this would have bearing on other sources: if we do remove this default behavior, then other sources would potentially "lose" this default value for content_type
, which to me would be okay.
That said, barring a decision there, looking good.
"https://geodata.libraries.mit.edu/record/abc:123", | ||
content_type=["Geospatial data"], | ||
citation="Test title 1. https://geodata.libraries.mit.edu/record/abc:123", | ||
content_type=["Not specified"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if a value of "Not specified" is what we want for the content_type
?
Took me a minute to track it down, but looks like this is applied by the base Transformer
class here if the source transformer does not set it.
When content_type
was considered a higher level type (if maybe somewhat variable across sources), I think this made sense. But if we interpret it as a slightly lower level type, where it might be okay to be NULL
, we might want to consider removing this logic from the base Transformer and allowing it to be NULL
.
Taking content_type
examples from the table in the ADR, I'm not sure what we gain by having "Not specified"
showing up in those values.
I do think this might make sense for the proposed, higher level content_class
, and that probably should be a required field as well. But as it stands, I believe that content_type
is not required, and therefore we could just leave it set as None
or an empty list []
.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If helpful for discussion, here is a breakdown of the format
/ content_type
combinations from MIT data:
┌───────────────────────────────────────┬───────────┬─────────────┐
│ content_type │ format │ combo_count │
│ varchar[] │ varchar │ int64 │
├───────────────────────────────────────┼───────────┼─────────────┤
│ [Polygon data] │ Shapefile │ 1089 │
│ [Point data] │ Shapefile │ 620 │
│ [Line data] │ Shapefile │ 293 │
│ [Polygon data, Point data] │ Shapefile │ 33 │
│ [Mixed] │ │ 10 │
│ [Line data, Point data] │ Shapefile │ 4 │
│ [Not specified] │ GeoTIFF │ 3 │
│ [Not specified] │ │ 3 │
│ [Line data, Point data, Polygon data] │ Shapefile │ 1 │
│ [Polygon data, Point data, Mixed] │ Shapefile │ 1 │
│ [Mixed] │ Shapefile │ 1 │
├───────────────────────────────────────┴───────────┴─────────────┤
│ 11 rows 3 columns │
└─────────────────────────────────────────────────────────────────┘
And OGM data (with some of the long tail omitted in the results, but looking at all 120k records):
┌───────────────────────────────────────────────┬────────────┬─────────────┐
│ content_type │ format │ combo_count │
│ varchar[] │ varchar │ int64 │
├───────────────────────────────────────────────┼────────────┼─────────────┤
│ [Image data] │ TIFF │ 30775 │
│ [Polygon data] │ Shapefile │ 26702 │
│ [Raster data] │ GeoTIFF │ 10084 │
│ [Line data] │ Shapefile │ 8146 │
│ [Not specified] │ JPEG │ 7094 │
│ [Point data] │ Shapefile │ 7090 │
│ [Topographic maps, Military maps] │ GeoTIFF │ 4129 │
│ [Not specified] │ GeoTIFF │ 4004 │
│ [Fire insurance maps] │ JPEG │ 3871 │
│ [Not specified] │ PDF │ 3009 │
│ [Image data] │ │ 2185 │
│ [Cadastral maps] │ JPEG │ 1782 │
│ [Atlases] │ JPEG │ 1245 │
│ [Raster data] │ MrSID │ 1233 │
│ [Not specified] │ Mixed │ 1223 │
│ [Geological maps] │ JPEG │ 1216 │
│ [Raster data] │ TIFF │ 916 │
│ [Aerial photographs] │ JPEG │ 680 │
│ [Not specified] │ │ 517 │
│ [Thematic maps] │ JPEG │ 453 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ [Raster data] │ PDF │ 1 │
│ [Early maps, World maps] │ JPEG2000 │ 1 │
│ [Zoning maps, Thematic maps] │ TIFF │ 1 │
│ [Mine maps, Topographic maps] │ JPEG │ 1 │
│ [Pictorial maps, Tourist maps] │ JPEG │ 1 │
│ [Aerial views, Pictorial maps, Thematic maps] │ JPEG │ 1 │
│ [Not specified] │ GeoJSON │ 1 │
│ [Table data, Line data] │ Shapefile │ 1 │
│ [Vector data, Raster data] │ Shapefile │ 1 │
│ [Vector data, Point data, Polygon data] │ Shapefile │ 1 │
│ [Nautical charts] │ JPEG2000 │ 1 │
│ [Early maps, Atlases] │ JPEG │ 1 │
│ [Vector data, Point data, Polygon data] │ GeoPackage │ 1 │
│ [Table data, Point data] │ │ 1 │
│ [Cadastral maps, Topographic maps] │ JPEG │ 1 │
│ [Multi-spectral data] │ Mixed │ 1 │
│ [Bathymetric maps, Thematic maps] │ JPEG │ 1 │
│ [Thematic maps, Fire insurance maps] │ JPEG │ 1 │
│ [Topographic maps] │ │ 1 │
│ [Multi-spectral data] │ Shapefile │ 1 │
├───────────────────────────────────────────────┴────────────┴─────────────┤
│ 168 rows (40 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that Not specified
no longer serves the same purpose but I would be inclined to group that change with the creation of the content_class
field rather than make the change here. That feels like a significant data model change (which I would feel about almost changes to the Transformer
class) to throw into this PR without having a fully considered what the replacement looks like and the impact on all the sources. So, yes but after we're ready to properly replace it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might pushback a bit that this is a significant data model change, insofar as it's just removing a default value of "Not specified" when the transformer class does not explicitly set it. By removing this logic, we don't modify any values, just lose this default.
Ran a check against all sources in Dev1, looks like 17,273 records have "Not specified", and here's a breakdown by source:
+----------+------------+
|source |count_star()|
+----------+------------+
|MIT Alma |7913 |
|DSpace@MIT|9360 |
+----------+------------+
But, to your point, it does feel like that change would be more aligned with the content_class
(name still TBD) work. Happy to wait and roll into a different PR! Thanks for considering here.
@@ -66,6 +65,7 @@ def test_aardvark_get_optional_fields_non_field_method_values_success( | |||
): | |||
transformer = MITAardvark("cool-repo", aardvark_record_all_fields) | |||
record = next(transformer) | |||
assert record.content_type == ["Vector data"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
@@ -126,7 +126,7 @@ def get_optional_fields(self, source_record: dict) -> dict | None: | |||
fields["alternate_titles"] = self.get_alternate_titles(source_record) or None | |||
|
|||
# content_type | |||
fields["content_type"] = ["Geospatial data"] | |||
fields["content_type"] = source_record.get("gbl_resourceType_sm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the comment above about whether this be allowed remain None
or a potentially empty list, I think it's a good sign that this change was so slight. Feels like the transformer for Aardvark records is flexible in a good way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given some discussion about removing "Not specified" as part of a different PR, approving this one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! As discussed, there will be a separate effort for the addition of a content_class
field.
Also, PR will need to be rebased with main
prior to merging, I think.
Purpose and background context
Implementing the decision described in ADR #2
How can a reviewer manually see the effects of these changes?
Run the following command and observe that
Vector data
appears incontent_type
:Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES
What are the relevant tickets?
Developer
Code Reviewer(s)