Merge pull request #125 from MITLibraries/GDT-193-adr-handle-data-forms

ADR to handle dataset information in TIMDEX
MITLibraries · Feb 27, 2024 · eb13daf · eb13daf
2 parents 285a02d + 44e9ef9
commit eb13daf
Showing 1 changed file with 253 additions and 0 deletions.
diff --git a/docs/adrs/0002-field-for-data-type-form-information.md b/docs/adrs/0002-field-for-data-type-form-information.md
@@ -0,0 +1,253 @@
+# 2. Data Model Fields to Support "Format" and "Data Type" UI Filters
+
+Follows: https://mitlibraries.github.io/guides/misc/adr.html
+
+Date: 2024-02-15
+
+## Status
+
+Accepted
+
+## Context
+
+### Goal
+
+Identify TIMDEX fields to store information about GIS resources that will support "Format" and "Data Type" UI facet filters for the GDT project.
+
+### Overview
+
+With the addition of geospatial data to TIMDEX, and a new "geo" focused UI, the question was raised as to what fields should be used for two new UI facet filters:
+
+  * "Data Type"
+    * values like: `[Polygon, Point, Raster, Image, ...]`
+  * "Format"
+    * values like: `[Shapefile, TIFF, GeoTIFF, JPEG, ...]`
+
+This would mirror functionality in the [current "Geoweb" application](https://geodata.mit.edu/) that is slated to be sunsetted.
+
+The source records come from a new [GeoHarvester](https://github.com/MITLibraries/geo-harvester), which normalizes geospatial metadata from a variety of formats to the
+geospatial, discovery scoped metadata format [Aardvark](https://opengeometadata.org/ogm-aardvark/).  Transmogrifier then converts these Aardvark records into TIMDEX records.
+
+Data for both UI filters will be controlled terms in the Aardvark records created by GeoHarvester:
+
+  * UI filter "Format" from Aardvark field `dct_format_s`:
+    * https://opengeometadata.org/ogm-aardvark/#format-values (e.g. "Shapefile", "GeoTIFF", "TIFF" etc.)
+  * UI filter "Data Type" from Aardvark field `gbl_resourceType_sm` (from either list):
+    * https://opengeometadata.org/ogm-aardvark/#resource-type-values-loc (e.g. `[Atlases, Cadastral maps, Quadrangle maps]`)
+    * https://opengeometadata.org/ogm-aardvark/#resource-type-values-ogm (e.g. `[Polygon data, Raster data, Line data, Point data]`)
+
+The first iteration mapped in the following way:
+  * `TIMDEX.content_type` = "Geospatial data" (static value for all records)
+  * `TIMDEX.format` = values from `Aardvark.dct_format_s`
+  * `TIMDEX.subjects@kind="Subject scheme not provided"` = values from `Aardvark.gbl_resourceType_sm`
+
+This resulted in a TIMDEX record like:
+```json
+{
+    "content_type": "Geospatial data",
+    "format": "Shapefile",
+    "subjects": [
+        {
+            "value": "Polygon data",
+            "kind": "Subject scheme not provided"
+        },
+        {
+            "value": "Vector data",
+            "kind": "Subject scheme not provided"
+        }
+    ]
+}
+```
+
+#### 1- The static value "Geospatial data" for `content_type` removes `content_type` as an option for these other data points
+
+It was originally thought that `content_type` was a very high level bucket, allowing various TIMDEX sources to share a facet with values like,
+[Geospatial data, Language materials, Musical sound recording, Projected medium, ...].
+
+However, on further analysis it was observed that `content_type` _could_ be a good fit for values like `Polygon data` or `Raster data`, as the long tail of the current `content_type` data shows quite a few examples
+of hierarchically and thematically similar values for other sources like [Thesis, Article, Diaries, Scrapbooks, ...].  These values are concerned with the conceptual or structural makeup of the resource,
+not the container or external form which field `format` is more aligned with.
+
+#### 2- DSpace is currently the only TIMDEX source that writes to `format` with a hardcoded value `electronic resource`
+
+Analyzing the use of `format` revealed that _only_ the DSpace source was writing to this TIMDEX field.  This obscured the possibility that some external/container oriented terms
+currently found in `content_type` -- from various pre-existing TIMDEX sources -- might make more sense as `format` values.  And, that other TIMDEX sources could begin writing to this field for 
+the first time.
+
+Better understanding this, the external container values found in `Aardvark.dct_format_s` like "Shapefile" or "GeoTIFF" seems aligned with the `format` field.  And, what should
+not be ignored, this results in naming parity across all hops: `Aardvark.dct_format_s --> TIMDEX.format --> UI:Format`.
+
+#### 3- The "resource type" data (from `Aardvark.gbl_resourceType_sm`) are arguably not subjects, and furthermore do not have a qualifier that allows cherry-picking them for aggregation
+
+While it was initially convenient to place values from `Aardvark.gbl_resourceType_sm` in `subjects`, it was generally agreed upon that they are not subjects.  Furthermore, even
+with a `kind` qualifier, it would be awkward at best to access them, or at worst reduce the meaning and usefulness of `subjects` across all TIMDEX sources.
+
+Still thinking that `content_type` was tied up with the static "Geospatial data" value, this was a primary blocker.  `content_type` is ultimately a better fit for these values, is already
+multivalued, and has precedence with other source mappings like source `datacite` where a source metadata field `ResourceType.resourceTypeGeneral` maps to `content_type` (note the metadata 
+field naming parity with Aaardvark).
+
+#### 4- Where do we encode "Geospatial data" then?
+
+Considering all above, where then _can_ we encode "Geospatial data" to differentiate 100k GIS records from 1m ebooks in Alma?  How can a user quickly cull _all_ of TIMDEX down to just records that
+likely have geospatial data (i.e. GIS layers) without needing to immediately jump to much more granular distinctions like "Polygon data" or "Raster data", thereby also removing relevant records from their
+search/browsing?
+
+This might indicate an area where the TIMDEX data model could grow: adding a higher level, highly controlled, "categorical" field that would group records at this level.  An
+example is Primo, which has a "Resource Type" filter which filters at this level (ignoring the naming collision of Aardvark's `gbl_resourceType_sm` field which has a narrower scope).  While 
+TIMDEX is not Primo, it is also not _not_ Primo insofar as it aggregates metadata records from fairly disparate platforms and sources.  This act of aggregation might require an additional level of 
+categorical hierarchy to filter records by.  
+
+A potentially important note: we might _not_ expect to find the values/terms for this high level, categorical field in the metadata records themselves, because their descriptive focus is not 
+concerned with that level of description.
+
+Examples:
+- a GIS FGDC XML file does not need to indicate "Geospatial data" anywhere in the record
+- an ArchivesSpace EAD record does not need to indicate "Archival material" anywhere in the record
+- Alma is trickier: MARC records often _do_ attempt to classify at this high level, and some of these have ended up in `content_type` and likely could be "pushed up" to this new field
+
+Given the presence of `content_type` already, to model after other aggregation platforms and metadata schemas, this new field might be something along the lines of `content_class`.  These values may 
+be static per TIMDEX source (e.g. ASpace = "Archival Materials", GIS = "Geospatial data") or the source's metadata records may contribute to the logic (e.g. MARC's `006` and `008` field relationship 
+that might provide `content_class` level terms like ["Books", "Music", "Visual Materials", "Maps", ...]).
+
+It is likely beyond the scope of this ADR to propose the addition of that field, but introducing it as a thought experiment takes some conceptual pressure off of `content_type` which could operate
+as intended, and to some degree already is, as more granular "type" or "form" information about the resource.
+
+### Possible Solutions
+
+#### Option 1- Use `subjects` with `kind="Data Type"`
+
+In this approach, "Data Type" values would be stored as `subjects` with `kind="Data Type"`.
+
+Example:
+```json
+{
+    "subjects": [
+        {
+            "value": "Polygon",
+            "kind": "Data Type"
+        },
+        {
+            "value": "Vector",
+            "kind": "Data Type"
+        }
+    ]
+}
+```
+
+Pros:
+  * does not require a change to TIMDEX data model anywhere
+
+Cons:
+  * these "Data Type" values don't feel like subjects; they are not really _about_ the resource so much as describing its type/structure/form 
+
+
+#### Option 2- Create new, multivalued string field `form`
+
+In this approach, "Data Type" value would be stored in a new, multivalued string field `form`:
+
+Example:
+```json
+{
+    "form": ["Polygon", "Vector"]
+}
+```
+
+Pros:
+  * purely additive change to data model
+  * simple, top level property makes aggregations very simple
+
+Cons:
+  * still require, and sit along next to, `literary_form` field for describing text sources as "Fiction" or "Nonfiction"
+
+
+#### Option 3- Create new, multivalued objects field `form`; collapse `literary_form` into this
+
+In this approach, "Data Type" value would be stored in a new, multivalued object field `form`:
+
+Example:
+```json
+{
+    "form": [
+        {
+            "value": "Polygon",
+            "kind": "Data Type"
+        },
+        {
+            "value": "Vector",
+            "kind": "Data Type"
+        }
+    ]
+}
+```
+
+Pros:
+  * allows collapsing of `literary_form` field; noting some shared sentiment that this field might be too source-specific for TIMDEX
+  * like other object fields, leaves the door open for adding a `uri` property at a later time
+
+Cons:
+  * would require reworking the transformations + re-indexing any sources that use `literary_form`
+  * nested field type, a bit harder to query for aggregations
+
+#### Option 4 - Use `file_formats` for current `format` values and `format` for Data Type values
+
+In this approach, the current `MITAardvark.format` values would shift to the previously unused `MITAardvark.file_formats` property and the Data Type values would be stored in `MITAardvark.format`
+
+Example:
+```json
+
+{
+    "content_type": "Geospatial data",
+    "format": ["Polygon", "Point", "Raster", "Image"],
+    "file_formats": ["Shapefile", "TIFF", "GeoTIFF", "JPEG"]
+}
+```
+
+Pros:
+  * does not require TIMDEX data model changes
+
+Cons:
+  * `file_formats` has previously only stored MIME type values, such as `application/pdf`
+  * may require explanation of the facet mapping in the UI documentation
+  * may require updates of other transform classes for consistency
+
+#### Option 5 - Map `format` to "Format" filter, map `content_type` to "Data Type" filter
+
+In this approach, there would be **no** immediate data model changes.  As outlined above, both the pre-existing `format` and `content_type` fields would be sufficient
+for mapping data from the Aardvark records in such a way to support "Format" and "Data Type" UI filters.
+
+This option **does** implicitly propose a new higher level TIMDEX field, something along the lines of `content_class`, but this is not an immediate requirement, and it 
+might be helpful to decouple that from this decision at hand.
+
+- GIS TIMDEX sources
+  - continue to map Aardvark `dct_format_s` to TIMDEX `format`, driving the new "Format" UI filter
+  - map Aardvark `gbl_resourceType_sm` to TIMDEX `content_type`, driving the new "Data Type" UI filter
+    - this "Data Type" UI filter, pulling from `content_type`, likely only makes sense in the Geo TIMDEX UI context
+    - a "Content Type" facet filter across all TIMDEX records would have values like ["Short stories", "Polygon data", "Thesis", ...] sitting next to each other
+  - accept that "Geospatial data" is not an available, high level value to facet on until a field like `content_class` is modeled and added
+
+- Other TIMDEX sources
+  - DSpace
+    - remove blanket `format=electronic resource` and instead derive `format` values from available `file_formats` mimetypes for friendly forms
+    - e.g. `application/pdf` suggests "PDF", or `text/csv` suggests "CSV", to name a couple example
+    - there are python libraries that can handle 90% of these conversions, if a friendly form is not present in the library
+
+Examples across multiple sources:
+
+| `source`       | `content_class :str` [_proposed but not exists_] | `content_type :list[str]`   | `format :str` | `file_formats :list[str]`      |
+|----------------|--------------------------------------------------|-----------------------------|---------------|--------------------------------|
+| gismit/gisogm  | Geospatial data                                  | [Raster data, Image data]   | GeoTIFF       | [image/tiff]                   |
+| gismit/gisogm  | Geospatial data                                  | [Polygon data, Vector data] | Shapefile     | [application/zipped-shapefile] |
+| alma           | Language material                                | [Short stories]             | EPub Ebook    | [application/epub+zip]         |
+| alma           | Cartographic material                            | [Quadrangle map]            | Scanned map   | [image/jpeg]                   |
+| dspace         | Language material                                | [Academic thesis]           | PDF           | [application/pdf]              |
+
+## Decision
+
+The decision is to go with Option 5: Map `format` to "Format" filter, map `content_type` to "Data Type" filter.
+
+## Consequences
+
+For GIS records, both the "Format" and "Data Type" UI filters will mirror those same filters in the legacy "Geoweb" system, with the same or highly similar values.
+
+For other TIMDEX sources, given the decision outlined above, there should be little to no effect.  However, as discussed, the inability to encode "Geospatial data" may suggest that a new, higher level
+TIMDEX categorical field would be beneficial, and this would require work in Transmogrifier to provide a meaningful value for this new field for all sources.