Skip to content

Updated NMDC metadata to include inlined biosample metadata descriptors. #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 9, 2025

Conversation

jeff-cohere
Copy link
Collaborator

@jeff-cohere jeff-cohere commented Mar 19, 2025

This PR creates a distinction between "file" and "data" descriptors. The former category includes descriptors that (ehm) describe files to be moved in transfers, whereas the latter includes in-line data which is placed into the transfer manifest. This lets us include NMDC biosample metadata, which is not stored in files within NMDC. These metadata are stored in a simple JSON format which is transplanted to the manifest as is.

Update: well, that was fun. I'm sure there are still some minor issues remaining, but I think we at least know how we are expected to retrieve credit and biosample metadata related to data objects now. Of note:

  • We extract these metadata via a new endpoint that relies on each data object's workflow execution ID.
  • Analysis data objects have workflow execution IDs, but raw data objects do not. Says Alicia Clum of this: "If it is the raw data, not sure why you'd want this b/c NMDC isn't supposed to be a 'data repository'."
  • For search by study, we start by fetching the study's metadata and then we get all the data objects from the study--two API calls total, since we don't need biosample metadata. This is pretty zippy.
  • For filtered search and by-file-id lookup, the only way to get a file's metadata is by looking it up via the file's workflow execution ID (where it exists). In testing, I've found that often each file has its own workflow execution ID. Tracing metadata for each file needs a network roundtrip for each file, regardless of whether multiple files belong to the same study, because we just don't know. So these queries can take a while. It may not matter too much, because by-file-id lookup is used only to check whether files are staged, and during the transfer process (which is also slow!). But it still feels suboptimal.

Anyway: onward!

@jeff-cohere jeff-cohere added enhancement New feature or request NMDC metadata labels Mar 19, 2025
@jeff-cohere jeff-cohere force-pushed the frictionless-overhaul branch from c1adb08 to df3d850 Compare March 26, 2025 15:02
Base automatically changed from frictionless-overhaul to main March 26, 2025 15:05
Copy link

github-actions bot commented Mar 26, 2025

PR Preview Action v1.6.0

🚀 View preview at
https://kbase.github.io/dts/pr-preview/pr-117/

Built to branch gh-pages at 2025-04-09 14:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@jeff-cohere jeff-cohere merged commit 93d21ab into main Apr 9, 2025
5 checks passed
@jeff-cohere jeff-cohere deleted the nmdc-md-updates branch April 9, 2025 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request metadata NMDC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants