Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Milestone - Define EnvO value sets for supported environmental extensions (3.2) #468

Open
ssarrafan opened this issue Oct 6, 2023 · 23 comments
Assignees

Comments

@ssarrafan
Copy link
Contributor

ssarrafan commented Oct 6, 2023

A key part of the schema is the allocation of different metadata elements to different environmental
packages (e.g., ‘depth’ is a required metadata element for soil and sediment samples, and conversely
‘altitude’ is required for aerial samples). In the Pilot, we directly adopted the MIxS environment packages,
and extended them with fields required by EMSL and JGI. While this provided a foundation, we identified
many areas where the MIxS environmental packages are too rigid, or are at suboptimal levels of
granularity. In collaboration with the GSC and the broader research community, we will support the
development of more specific packages for a variety of ecosystems (e.g., environments like wetlands,
mangroves or complex riparian systems should have their own package extensions, and the schema allows
for progressive refinement or crossing of packages), and continue to improve existing packages based on
community feedback. To address a common community challenge in navigating ontologies, each of these
environmental packages will be supported by defined EnvO value sets (cross-sections of the ontology with
key terms relevant for a specific environment) such that data submitters can provide precise and accurate
descriptive terms through a simple dropdown, without having to navigate the whole EnvO structure
(Submission Portal, Milestone 3.2).
Page 28

see #469 #470 #471

@ssarrafan ssarrafan converted this from a draft issue Oct 6, 2023
@mslarae13
Copy link
Contributor

This should be environmental extensions. We need to make this correction across lots of things.

Do we have target extensions? Is it all the ones currently on the subport? (Which is basically all)

@ssarrafan
Copy link
Contributor Author

This should be environmental extensions. We need to make this correction across lots of things.

Do we have target extensions? Is it all the ones currently on the subport? (Which is basically all)

@cmungall can you respond to Montana's questions please.

@cmungall
Copy link

cmungall commented Nov 3, 2023

We should prioritize BER-relevant ones

@ssarrafan
Copy link
Contributor Author

This is a Q4 milestone. Updating issue to Q4.

@ssarrafan ssarrafan moved this from Q1 2023 Oct-Dec to Q4 2024 Jul-Sep in Year 2 & 3 Milestones Jan 5, 2024
@mslarae13 mslarae13 moved this to ⚖️ Indirect in Submission Portal Tracking Apr 18, 2024
@aclum aclum changed the title Milestone - Define EnvO value sets for supported environmental packages (3.2) Milestone - Define EnvO value sets for supported environmental extensions (3.2) Jul 19, 2024
@aclum
Copy link
Contributor

aclum commented Jul 19, 2024

@mslarae13 and @turbomam will get together to discuss this.

@mslarae13 mslarae13 moved this from ⚖️ Indirect to ❌ Discuss: Priority or Blocked in Submission Portal Tracking Jul 19, 2024
@aclum
Copy link
Contributor

aclum commented Jul 23, 2024 via email

@turbomam
Copy link
Member

turbomam commented Jul 23, 2024

Thanks @aclum and @mslarae13 for tending to this. I have been thinking about different ways to keep track of our intentions, the implementations, and whether a value set is complete. There's probably no one perfect way of doing it.

I think we should decide

  • do we have some value sets that can be considered exemplars to guide the ones we haven't done yet
  • will we need to use classes from any ontologies other than EnvO and PO for the environments that have been marekd as high priority?
  • do we anticipate that we may need to request some classes in EnvO in order to build value sets with adequate coverage for our collaborators?
    • where can we find the kinds of values we will need to cover? Values that are already in MongoDB? Values that appear in submissions but haven't made their way into MongoDB yet? Our Postgres representation of NCBI Biosamples?
  • beyond the exemplars above, what technologies will we use to generate these value sets (including expert human review)
  • what practices will we use to refine them in the future
  • what file in what repo will serve as the source of truth for the value sets
  • (somewhat independent) how will the value sets will be presented to Submission Portal users, or how the input form the submitters will be validated

I have created a table that relates @mslarae13 's recent prioritization list with some other knowledge about the environments/Extensions/DH Interfaces. I would like to include most of this information in whatever progress tracking system we use. Since the table is wide, maybe we should move it to a Google Sheet or a repo-checked-in TSV, instead of embedding it in an issue like this.

MixS Environment name submission portal DhInterface name harmonizerApi.ts status priority in #468 (comment) env_broad_scale env_local_scale env_medium
HumanAssociated   disabled        
HumanGut   disabled        
HumanOral   disabled        
HumanSkin   disabled        
HumanVaginal   disabled        
PlantAssociated PlantAssociatedInterface published high      
Sediment SedimentInterface published high      
Soil SoilInterface published high      
Water WaterInterface published high      
Air AirInterface published low      
BuiltEnvironment BuiltEnvInterface published low      
HostAssociated HostAssociatedInterface published low      
HydrocarbonResourcesCores HcrCoresInterface published low      
HydrocarbonResourcesFluidsSwabs HcrFluidsSwabsInterface published low      
MicrobialMatBiofilm BiofilmInterface published low      
MiscellaneousNaturalOrArtificialEnvironment MiscEnvsInterface published low      
WastewaterSludge WastewaterSludgeInterface   low      
Agriculture            
FoodAnimalAndAnimalFeed            
FoodFarmEnvironment            
FoodFoodProductionFacility            
FoodHumanFoods            
SymbiontAssociated            

@turbomam
Copy link
Member

turbomam commented Jul 23, 2024

@pkalita-lbl you can see that I have tracked the DhInterface name from submission-schema/schemasheets/tsv_in/classes.tsv and the status from harmonizerApi.ts in my table above

I didn't include the excel_worksheet_name annotations form your new

but the table is intended to do some of the mapping that we have been talking about.

I'm a little surprised that WastewaterSludgeInterface appears many places in the submission-schema repo (and @mslarae13 included it in her prioritization list, albeit as low) but it doesn't appear in harmonizerApi.ts

@aclum
Copy link
Contributor

aclum commented Jul 23, 2024

will we need to use classes from any ontologies other than EnvO and PO for the environments that have been marekd as high priority?

  • I'd say no. UBERON was the other ontology Chris wanted to allow but it is not applicable to the high priority environments.

@mslarae13
Copy link
Contributor

mslarae13 commented Jul 23, 2024

Updating @turbomam 's table (IN PROGRESS)

MixS Environment name submission portal DhInterface name harmonizerApi.ts status priority in #468 (comment) env_broad_scale env_local_scale env_medium
HumanAssociated   disabled        
HumanGut   disabled        
HumanOral   disabled        
HumanSkin   disabled        
HumanVaginal   disabled        
PlantAssociated PlantAssociatedInterface published high      
Sediment SedimentInterface published high      
Soil SoilInterface published high      
Water WaterInterface published high      
Air AirInterface published low      
BuiltEnvironment BuiltEnvInterface published low      
HostAssociated HostAssociatedInterface published low      
HydrocarbonResourcesCores HcrCoresInterface published low      
HydrocarbonResourcesFluidsSwabs HcrFluidsSwabsInterface published low      
MicrobialMatBiofilm BiofilmInterface published low      
MiscellaneousNaturalOrArtificialEnvironment MiscEnvsInterface published low      
WastewaterSludge WastewaterSludgeInterface unpublished (need to add)  low      
Agriculture     high       
FoodAnimalAndAnimalFeed     low       
FoodFarmEnvironment     low        
FoodFoodProductionFacility     low        
FoodHumanFoods      low      
SymbiontAssociated      low      

@mslarae13
Copy link
Contributor

mslarae13 commented Jul 23, 2024

The following extensions are NOT in NMDC. I'm not sure why, and we need to check what version of MIxS we're using. I'll make a separate issue for that. but for this milestone & the squad addressing it, we'll skip these extensions

Agriculture
FoodAnimalAndAnimalFeed
FoodFarmEnvironment
FoodFoodProductionFacility
FoodHumanFoods
SymbiontAssociated

Edit, this issue exists, which is similar. nmdc submission-schema and nmdc-schema don't seem to be aware of slots that are unique to these extensions. Making me conclude we don't use v6.

@ssarrafan
Copy link
Contributor Author

@mslarae13 @aclum @cmungall thanks for all the updates on this issue. Will this be done by September? This is due this quarter.

@mslarae13
Copy link
Contributor

Will this be done by September? This is due this quarter.

@ssarrafan that's the goal

@ssarrafan
Copy link
Contributor Author

Per @cmungall Patrick is not needed for this issue. Discussed at meeting today with Alicia, Emiley, Chris.
FYI @mslarae13

@mslarae13
Copy link
Contributor

@ssarrafan due date for this is still end of September, right?

@ssarrafan
Copy link
Contributor Author

@ssarrafan due date for this is still end of September, right?

Yes so far.

@mslarae13 mslarae13 linked a pull request Aug 28, 2024 that will close this issue
@mslarae13 mslarae13 removed a link to a pull request Aug 28, 2024
@mslarae13
Copy link
Contributor

We discussed env_broad_scale for soil this week.
We'll work on medium next week.

I would like to add some clarity around what we might mean by done :) Because complete and in production is no longer likely by Monday (since that's when we're pushing to production)

  • By end of september, Env_broad, local, and medium terms will be identified for SOIL from manual evaluation of OAK queries
  • Broad and medium OAK queries will be updated for SOIL based on the evaluation (@turbomam does that sound doable? By Sept end)

Early October

  • Env local will be evaluated for SOIL & OAK query updated
  • ENV triad enums will be updated to match those that are provided by the query for SOIL
  • OCTOBER production release will include these queries.

EARLY NOVEMBER

  • start query eval for sediment

@ssarrafan @emileyfadrosh @cmungall @lamccue
Unfortunately, this does mean the milestone won't be complete by September. IF we can crank through local by the end of the month, then it's kinda done, just isn't on production.

Can we reschedule this milestone?

@ssarrafan
Copy link
Contributor Author

Discussed this milestone with @turbomam today and he said he would bring it up at the squad meeting on Wednesday. Montana has already created great sub-issues that can be linked here. They will discuss how much more review is needed on Wednesday and then update this milestone with the new estimated timeline.
FYI @mslarae13 @cmungall

@mslarae13
Copy link
Contributor

mslarae13 commented Oct 2, 2024

@ssarrafan @turbomam @cmungall
See above comment for the new time line. I already sorted this out. Unless early october isn't still do-able, but the release is at the end of october, and we wouldn't make the change until post berkeley-rollout any way.

@sierra-moxon
Copy link
Member

sierra-moxon commented Oct 4, 2024

Updates for DOE report, replicated in part from https://github.com/microbiomedata/issues/blob/ADR-define-envo-value-sets/decisions/0015-env-triad-terms.md

For soil environment specifically: (MIxS Extension)

  • env_broad_scale will exclude aquatic biome [ENVO:00002030]
  • env_local_scale (in progress)
  • env_medium (in progress) - will descend from "soil" term

In addition to these programmatic extractions, we collate metadata, including whether the term has been used in NCBI or GOLD biosamples and whether it is currently used at NMDC directly via biosample submission. Our team of ENVO oncologists, data analysts, and subject matter experts in the field then review the resulting lists generated programmatically. They use their knowledge to constrain the lists further to value sets small enough for users to navigate in submissions.

@ssarrafan
Copy link
Contributor Author

DOE report today update and moving to Q2:
The soil value sets have been completed and are implemented in the Submission Portal. Sediment and water values are near complete, while the plant value sets require additional terms from the Plant Ontology (PO) that require some modifications to the process. As part of the work this quarter, we refactored the code that automatically generated candidate value sets from queries and mappings, making the overall process more transparent and reproducible. Anticipated competition FY25 Q2

@ssarrafan ssarrafan moved this from Q4 2024 Jul-Sep to Q2 FY2025 in Year 2 & 3 Milestones Jan 10, 2025
@sierra-moxon
Copy link
Member

This is slated to be complete on Friday Jan 31, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In Progress
Status: Q2 FY2025
Development

No branches or pull requests

6 participants