Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICAT to PaNOSC Data Model Conversion #265

Closed
3 tasks
MRichards99 opened this issue Sep 23, 2021 · 7 comments · Fixed by #300
Closed
3 tasks

ICAT to PaNOSC Data Model Conversion #265

MRichards99 opened this issue Sep 23, 2021 · 7 comments · Fixed by #300
Assignees
Labels
enhancement New feature or request expands-search-api Issues relating to the ExPaNDS Search API section of this repo

Comments

@MRichards99
Copy link
Collaborator

MRichards99 commented Sep 23, 2021

Description:
Once the data model for the Search API has been created, there needs to be some way of converting between the PaNOSC and ICAT data models. I believe we will need to convert both ways (i.e. from PaNOSC to ICAT and from ICAT to PaNOSC) to support query filters and then to convert the data from ICAT into a format suitable to be outputted in the Search API.

The current plans of converting from ICAT to PaNOSC (large credits to @louise-davies):

  • Affiliation (TBA on ICAT) <-> Affiliation
  • Dataset <-> Dataset
  • Investigation <-> Document
  • Datafile <-> File
  • Instrument <-> Instrument
  • InvestigationUser <-> Member
  • InvestigationParameter <-> Parameter
  • User <-> Person
  • Sample <-> Sample
  • Technique (TBA on ICAT) <-> Technique

The property isPublic exists on the Dataset and Document entities in the PaNOSC data model. Since the anon user will be used in ICAT to get data, we can assume this value will always be true. However, the releaseDate could be used in ICAT to determine is a piece of data is actually public or not.

I'm not entirely sure how we go about converting between data models, perhaps we have a function in each PaNOSC entity class called convert_to_icat() and this constructs an ICAT entity in Python ICAT (using client.create())? Then there needs to be something similar to convert from ICAT to PaNOSC although I'm less sure how that would work at this stage.

There should also be tests to verify that the conversion works correctly.

Acceptance criteria:

  • Add mechanism to convert from PaNOSC to ICAT data model
  • Add mechanism to convert from ICAT to PaNOSC data model
  • Add tests for each conversion method
@MRichards99 MRichards99 added enhancement New feature or request expands-search-api Issues relating to the ExPaNDS Search API section of this repo labels Sep 23, 2021
@louise-davies
Copy link
Member

I think in our discussion on this last week we decided that an ICAT Dataset is a PaNOSC Dataset, and that an Investigation is a Document

@MRichards99
Copy link
Collaborator Author

@louise-davies thanks for the clarification, I've just updated the list to reflect that :)

@MRichards99
Copy link
Collaborator Author

@louise-davies @agbeltran @RKrahl @axelboc-esrf @andygotz @antolinos

I have been looking at how the PaNOSC data model can be mapped to ICAT. At STFC, ISIS is our primary use case so my suggested mappings are mostly based for that facility. Since other facilities in the ICAT collaboration may wish to use our implementation of the search API, we would like your thoughts on the specific field mappings that I have suggested below.

Feel free to comment here with your thoughts, this could also be a good discussion to have at this week's ICAT collaboration meeting?

Affiliation

The mapping for this entity is based on icatproject/icat.server#248. I haven't been following the schema discussions closely so apologies if that is not the most current implementation for ICAT Server.

PaNOSC Field Name ICAT Field Mapping Comment
name affiliation.name
pid affiliation.pid
address This is not mandatory for PaNOSC
city This is not mandatory for PaNOSC
country This is not mandatory for PaNOSC

Dataset

PaNOSC Field Name ICAT Field Mapping Comment
pid dataset.doi
title dataset.name
isPublic evaluate based on dataset.createTime Search API will only use open data to start with anyway so this could be set to always True or if the anon user can see the dataset, then assume True
creationDate dataset.createTime
size unsure Fetching the dataset size would mean sending a getSize request to TopCAT

Document

PaNOSC Field Name ICAT Field Mapping Comment
pid investigation.doi
isPublic evaluate based on startDate or releaseDate ISIS define open data as over 3 years old
type investigation.type.name
title investigation.name
summary investigation.summary
doi investigation.doi Note that it's already used in the pid field
startDate investigation.startDate
endDate investigation.endDate
releaseDate investigation.releaseDate
license Not stored in ICAT but isn't mandatory for search API, ignore?
keywords keywords.name PaNOSC data model specifies this as a list of strings so the search API can iterate through keywords.name

File

PaNOSC Field Name ICAT Field Mapping Comment
id datafile.id
name datafile.name
path datafile.location
size datafile.fileSize

Instrument

PaNOSC Field Name ICAT Field Mapping Comment
id instrument.id
name instrument.name
facility instrument.facility.name Could either use name or fullName?

Member

PaNOSC Field Name ICAT Field Mapping Comment
role investigationUser.role

Parameter

PaNOSC Field Name ICAT Field Mapping Comment
name investigationParameter.type.name
value investigationParameter.numericValue or stringValue or dateTimeValue Value can be a number or a string so multiple ICAT fields can be checked, ICAT also has dateTimeValue which we could check and convert into a string if there's no stringValue or numericValue
unit investigationParameter.type.units Could maybe use unitsFullName, ISIS stores "None" for units and null for unitsFullName.

Person

PaNOSC Field Name ICAT Field Mapping Comment
id user.id Anon user doesn't seem to have permissions to query users on ISIS directly, I could only see users when including them through a different entity
fullName user.fullName
orcid user.orcidId
researcherId unsure Not sure what this is or if we store it in ICAT. It's not mandatory so maybe best to ignore?
firstName user.givenName
lastName user.familyName

Sample

PaNOSC Field Name ICAT Field Mapping Comment
name sample.name
pid sample.pid Doesn't seem to be used in ISIS
description sample.parameters.type.description parameters will be a list so could be hard to choose one if multiple parameters are associated with a single sample. This field isn't mandatory, maybe ignore?

Technique

PaNOSC Field Name ICAT Field Mapping Comment
pid technique.pid
name technique.name

MRichards99 added a commit that referenced this issue Oct 25, 2021
- Comments represent mappings to ICAT data model
- This commit also adds a base class for a search API entity (`PaNOSCAttribute`) which has two abstract methods, which will be used to convert between data models when implemented
MRichards99 added a commit that referenced this issue Oct 25, 2021
- Comments represent mappings to ICAT data model
- This commit also adds a base class for a search API entity (`PaNOSCAttribute`) which has two abstract methods, which will be used to convert between data models when implemented
@antolinos
Copy link
Collaborator

Hi @MRichards99

Thanks for this.

I got a question about the mapping of the dataset:

pid dataset.doi

Does it mean that all datasets need to have a DOI to be exported via panosc API? We are not minting a DOI for each dataset and will not happen because we got too many of them.

Besides I was wondering, can not this mapping be configurable? Even if what you propose has lot of sense it does not exactly corresponds to what we need and I can imagine that it will happen to others too.
I gave a quick look and the fields like technique.name, sample.description are not currently stored in such a way today and even needs to have a recent version of ICAT that is not the case today for us.

@RKrahl
Copy link
Contributor

RKrahl commented Nov 8, 2021

Hi @antolinos

I got a question about the mapping of the dataset:
pid dataset.doi

Does it mean that all datasets need to have a DOI to be exported via panosc API?

No. Just because that attribute in the ICAT schema is named doi does not mean that only a DOI may be stored in it. By the way, the data model in the PaNOSC search API is notoriously inconsistent, in particular with respect to PIDs.

@antolinos
Copy link
Collaborator

Just because that attribute in the ICAT schema is named doi does not mean that only a DOI may be stored in it

Then, am I supposed to fill the column DOI with something that is not a DOI? If it is the case I don't think it is a very clean approach and might be error-prone at medium, long-term.

@RKrahl
Copy link
Contributor

RKrahl commented Nov 8, 2021

Then, am I supposed to fill the column DOI with something that is not a DOI?

If your dataset does not have a DOI, but any other PID, yes. Where else would you put it? Btw. that attribute name in the ICAT schema is legacy. We discussed it several times and agreed that it would better be named pid instead. That is why the newer PID attributes in Instrument, ParameterType, Sample, and Study introduced in ICAT 4.10.0 have been named pid (rf. icatproject/icat.server#198 and icatproject/icat.server#216).

If it is the case I don't think it is a very clean approach and might be error-prone at medium, long-term.

The attribute names are invisible to users. The admins that are the only ones to see them should be professionals enough to be able to deal with such legacy names.

MRichards99 added a commit that referenced this issue Dec 15, 2021
MRichards99 added a commit that referenced this issue Dec 15, 2021
- Entity name will be the format used in the mapping json file
MRichards99 added a commit that referenced this issue Dec 20, 2021
- I've added TODOs where there are things I'm still slightly unsure about
MRichards99 added a commit that referenced this issue Dec 21, 2021
- Added filename of the 'actual' mapping file to git ignore
VKTB pushed a commit that referenced this issue Jan 7, 2022
VKTB added a commit that referenced this issue Jan 17, 2022
VKTB added a commit that referenced this issue Jan 17, 2022
VKTB added a commit that referenced this issue Jan 24, 2022
VKTB added a commit that referenced this issue Jan 24, 2022
VKTB added a commit that referenced this issue Jan 27, 2022
A Document cannot have parameters that have dataset
VKTB added a commit that referenced this issue Jan 27, 2022
VKTB added a commit that referenced this issue Jan 27, 2022
VKTB added a commit that referenced this issue Jan 27, 2022
VKTB added a commit that referenced this issue Jan 28, 2022
Co-authored-by: Matthew Richards <32678030+MRichards99@users.noreply.github.com>
VKTB added a commit that referenced this issue Jan 31, 2022
MRichards99 added a commit that referenced this issue Jan 31, 2022
MRichards99 added a commit that referenced this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request expands-search-api Issues relating to the ExPaNDS Search API section of this repo
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants