Updated NMDC metadata to include inlined biosample metadata descriptors. #117

jeff-cohere · 2025-03-19T22:27:37Z

This PR creates a distinction between "file" and "data" descriptors. The former category includes descriptors that (ehm) describe files to be moved in transfers, whereas the latter includes in-line data which is placed into the transfer manifest. This lets us include NMDC biosample metadata, which is not stored in files within NMDC. These metadata are stored in a simple JSON format which is transplanted to the manifest as is.

Update: well, that was fun. I'm sure there are still some minor issues remaining, but I think we at least know how we are expected to retrieve credit and biosample metadata related to data objects now. Of note:

We extract these metadata via a new endpoint that relies on each data object's workflow execution ID.
Analysis data objects have workflow execution IDs, but raw data objects do not. Says Alicia Clum of this: "If it is the raw data, not sure why you'd want this b/c NMDC isn't supposed to be a 'data repository'."
For search by study, we start by fetching the study's metadata and then we get all the data objects from the study--two API calls total, since we don't need biosample metadata. This is pretty zippy.
For filtered search and by-file-id lookup, the only way to get a file's metadata is by looking it up via the file's workflow execution ID (where it exists). In testing, I've found that often each file has its own workflow execution ID. Tracing metadata for each file needs a network roundtrip for each file, regardless of whether multiple files belong to the same study, because we just don't know. So these queries can take a while. It may not matter too much, because by-file-id lookup is used only to check whether files are staged, and during the transfer process (which is also slow!). But it still feels suboptimal.

Anyway: onward!

github-actions · 2025-03-26T22:48:40Z

PR Preview Action v1.6.0
🚀 View preview at https://kbase.github.io/dts/pr-preview/pr-117/
Built to branch `gh-pages` at 2025-04-09 14:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

…ng!).

jeff-cohere added enhancement New feature or request NMDC metadata labels Mar 19, 2025

ialarmedalien approved these changes Mar 19, 2025

View reviewed changes

jeff-cohere force-pushed the frictionless-overhaul branch from c1adb08 to df3d850 Compare March 26, 2025 15:02

Base automatically changed from frictionless-overhaul to main March 26, 2025 15:05

jeff-cohere force-pushed the nmdc-md-updates branch from edaba82 to 5d836d5 Compare March 26, 2025 22:48

jeff-cohere added 9 commits April 9, 2025 07:42

Separating Frictionless descriptors into data and file variants.

55728f7

NMDC biosample metadata is now captured in-line in a DTS manifest.

a3cf97c

Preliminary inline NMDC biosample metadata (still needs more testing).

0a9b7fa

Rewiring NMDC database to use new related_resources endpoint.

21890ad

Reworked how metadata is retrieved for search (once again, with feeli…

f134eb0

…ng!).

Fixing error propagation.

cc179e5

Database and endpoint configs are now validated on startup.

0fc0fde

A few more error-related touch-ups.

e0b6127

Fixed some oversights uncovered by aggressive config validation.

a145ae7

jeff-cohere force-pushed the nmdc-md-updates branch from cbbb403 to a145ae7 Compare April 9, 2025 14:50

jeff-cohere merged commit 93d21ab into main Apr 9, 2025
5 checks passed

jeff-cohere deleted the nmdc-md-updates branch April 9, 2025 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated NMDC metadata to include inlined biosample metadata descriptors. #117

Updated NMDC metadata to include inlined biosample metadata descriptors. #117

jeff-cohere commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-04-09 14:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Updated NMDC metadata to include inlined biosample metadata descriptors. #117

Updated NMDC metadata to include inlined biosample metadata descriptors. #117

Conversation

jeff-cohere commented Mar 19, 2025 • edited Loading

github-actions bot commented Mar 26, 2025 • edited Loading

Built to branch gh-pages at 2025-04-09 14:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

jeff-cohere commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-04-09 14:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.