Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gdt 201 data type refactor #136

Merged
merged 2 commits into from
Feb 29, 2024
Merged

Gdt 201 data type refactor #136

merged 2 commits into from
Feb 29, 2024

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

Implementing the decision described in ADR #2

How can a reviewer manually see the effects of these changes?

Run the following command and observe that Vector data appears in content_type:

pipenv run transform -o output/gismit.json -i  tests/fixtures/aardvark/aardvark_record_all_fields.jsonl -s gismit

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Updating the TIMDEX data model in accordance with ADR 0002-field-for-data-type-form-information.md

How this addresses that need:
*  Remap content_type to the gbl_resourceType_sm value
* Remove gbl_resourceType_sm from get_subjects method
* Update corresponding unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-201
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mapping change looks good. But I left a comment for discussion about whether or not we want to keep ["Not specified"] as a default value for records now that we're slightly reimagining content_type.

I would observe that this would have bearing on other sources: if we do remove this default behavior, then other sources would potentially "lose" this default value for content_type, which to me would be okay.

That said, barring a decision there, looking good.

"https://geodata.libraries.mit.edu/record/abc:123",
content_type=["Geospatial data"],
citation="Test title 1. https://geodata.libraries.mit.edu/record/abc:123",
content_type=["Not specified"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if a value of "Not specified" is what we want for the content_type?

Took me a minute to track it down, but looks like this is applied by the base Transformer class here if the source transformer does not set it.

When content_type was considered a higher level type (if maybe somewhat variable across sources), I think this made sense. But if we interpret it as a slightly lower level type, where it might be okay to be NULL, we might want to consider removing this logic from the base Transformer and allowing it to be NULL.

Taking content_type examples from the table in the ADR, I'm not sure what we gain by having "Not specified" showing up in those values.

I do think this might make sense for the proposed, higher level content_class, and that probably should be a required field as well. But as it stands, I believe that content_type is not required, and therefore we could just leave it set as None or an empty list [].

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If helpful for discussion, here is a breakdown of the format / content_type combinations from MIT data:

┌───────────────────────────────────────┬───────────┬─────────────┐
│             content_type              │  format   │ combo_count │
│               varchar[]               │  varchar  │    int64    │
├───────────────────────────────────────┼───────────┼─────────────┤
│ [Polygon data]                        │ Shapefile │        1089 │
│ [Point data]                          │ Shapefile │         620 │
│ [Line data]                           │ Shapefile │         293 │
│ [Polygon data, Point data]            │ Shapefile │          33 │
│ [Mixed]                               │           │          10 │
│ [Line data, Point data]               │ Shapefile │           4 │
│ [Not specified]                       │ GeoTIFF   │           3 │
│ [Not specified]                       │           │           3 │
│ [Line data, Point data, Polygon data] │ Shapefile │           1 │
│ [Polygon data, Point data, Mixed]     │ Shapefile │           1 │
│ [Mixed]                               │ Shapefile │           1 │
├───────────────────────────────────────┴───────────┴─────────────┤
│ 11 rows                                               3 columns │
└─────────────────────────────────────────────────────────────────┘

And OGM data (with some of the long tail omitted in the results, but looking at all 120k records):

┌───────────────────────────────────────────────┬────────────┬─────────────┐
│                 content_type                  │   format   │ combo_count │
│                   varchar[]                   │  varchar   │    int64    │
├───────────────────────────────────────────────┼────────────┼─────────────┤
│ [Image data]                                  │ TIFF       │       30775 │
│ [Polygon data]                                │ Shapefile  │       26702 │
│ [Raster data]                                 │ GeoTIFF    │       10084 │
│ [Line data]                                   │ Shapefile  │        8146 │
│ [Not specified]                               │ JPEG       │        7094 │
│ [Point data]                                  │ Shapefile  │        7090 │
│ [Topographic maps, Military maps]             │ GeoTIFF    │        4129 │
│ [Not specified]                               │ GeoTIFF    │        4004 │
│ [Fire insurance maps]                         │ JPEG       │        3871 │
│ [Not specified]                               │ PDF        │        3009 │
│ [Image data]                                  │            │        2185 │
│ [Cadastral maps]                              │ JPEG       │        1782 │
│ [Atlases]                                     │ JPEG       │        1245 │
│ [Raster data]                                 │ MrSID      │        1233 │
│ [Not specified]                               │ Mixed      │        1223 │
│ [Geological maps]                             │ JPEG       │        1216 │
│ [Raster data]                                 │ TIFF       │         916 │
│ [Aerial photographs]                          │ JPEG       │         680 │
│ [Not specified]                               │            │         517 │
│ [Thematic maps]                               │ JPEG       │         453 │
│       ·                                       │  ·         │           · │
│       ·                                       │  ·         │           · │
│       ·                                       │  ·         │           · │
│ [Raster data]                                 │ PDF        │           1 │
│ [Early maps, World maps]                      │ JPEG2000   │           1 │
│ [Zoning maps, Thematic maps]                  │ TIFF       │           1 │
│ [Mine maps, Topographic maps]                 │ JPEG       │           1 │
│ [Pictorial maps, Tourist maps]                │ JPEG       │           1 │
│ [Aerial views, Pictorial maps, Thematic maps] │ JPEG       │           1 │
│ [Not specified]                               │ GeoJSON    │           1 │
│ [Table data, Line data]                       │ Shapefile  │           1 │
│ [Vector data, Raster data]                    │ Shapefile  │           1 │
│ [Vector data, Point data, Polygon data]       │ Shapefile  │           1 │
│ [Nautical charts]                             │ JPEG2000   │           1 │
│ [Early maps, Atlases]                         │ JPEG       │           1 │
│ [Vector data, Point data, Polygon data]       │ GeoPackage │           1 │
│ [Table data, Point data]                      │            │           1 │
│ [Cadastral maps, Topographic maps]            │ JPEG       │           1 │
│ [Multi-spectral data]                         │ Mixed      │           1 │
│ [Bathymetric maps, Thematic maps]             │ JPEG       │           1 │
│ [Thematic maps, Fire insurance maps]          │ JPEG       │           1 │
│ [Topographic maps]                            │            │           1 │
│ [Multi-spectral data]                         │ Shapefile  │           1 │
├───────────────────────────────────────────────┴────────────┴─────────────┤
│ 168 rows (40 shown)                                            3 columns │
└──────────────────────────────────────────────────────────────────────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that Not specified no longer serves the same purpose but I would be inclined to group that change with the creation of the content_class field rather than make the change here. That feels like a significant data model change (which I would feel about almost changes to the Transformerclass) to throw into this PR without having a fully considered what the replacement looks like and the impact on all the sources. So, yes but after we're ready to properly replace it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might pushback a bit that this is a significant data model change, insofar as it's just removing a default value of "Not specified" when the transformer class does not explicitly set it. By removing this logic, we don't modify any values, just lose this default.

Ran a check against all sources in Dev1, looks like 17,273 records have "Not specified", and here's a breakdown by source:

+----------+------------+
|source    |count_star()|
+----------+------------+
|MIT Alma  |7913        |
|DSpace@MIT|9360        |
+----------+------------+

But, to your point, it does feel like that change would be more aligned with the content_class (name still TBD) work. Happy to wait and roll into a different PR! Thanks for considering here.

@@ -66,6 +65,7 @@ def test_aardvark_get_optional_fields_non_field_method_values_success(
):
transformer = MITAardvark("cool-repo", aardvark_record_all_fields)
record = next(transformer)
assert record.content_type == ["Vector data"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@@ -126,7 +126,7 @@ def get_optional_fields(self, source_record: dict) -> dict | None:
fields["alternate_titles"] = self.get_alternate_titles(source_record) or None

# content_type
fields["content_type"] = ["Geospatial data"]
fields["content_type"] = source_record.get("gbl_resourceType_sm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the comment above about whether this be allowed remain None or a potentially empty list, I think it's a good sign that this change was so slight. Feels like the transformer for Aardvark records is flexible in a good way.

@ghukill ghukill self-requested a review February 29, 2024 14:42
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given some discussion about removing "Not specified" as part of a different PR, approving this one!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! As discussed, there will be a separate effort for the addition of a content_class field.

Also, PR will need to be rebased with main prior to merging, I think.

@ehanson8 ehanson8 merged commit 208fa36 into main Feb 29, 2024
4 checks passed
@ehanson8 ehanson8 deleted the GDT-201-data-type-refactor branch February 29, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants