-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update storage of annotations #684
Comments
Sounds very reasonable to me. We want to make a push for a cobrapy 1.0 release after this summer. There will be the GSoC projects, you want to work on a better SBML parser, and we will also contribute in a more focused manner. So then it is okay to introduce breaking changes. |
I also think that sounds reasonable. Might also think of a pandas DataFrame for annotations here with columns database, relation and id. |
There needs to be a decision on how annotations should look as a data structure. Important points for the annotations are:
My suggestion is:
The allowed qualifier types are hereby (ignore the libsbml part, this is just for serialization)
All existing annotations without qualifier/predication are assumed to have a This is very easy to implement on the SBML read and write part, but in addition probably other parsers have to be updated (json, mat, ?) and especially the tests have to be updated (because things like writing old annotations to SBML and reading them will fail tests due to the changed format). |
Hi all, |
Hi all, @matthiaskoenig @draeger @Midnighter @cdiener @ChristianLieven and other members of cobrapy.
Now, though it handles the biological qualifier, it makes the storing of the following annotation format very complex.
Now if we parse it in the above format, it will look as follows:
The "uniport" provider contains two same pairs of resources, and now there is no way to divide these alternative sets of annotations to get back the original format. All the resources are listed corresponding to the "provider" name and do not seem to have any relationship with each other, and hence hard to distinguish.
So here also, the parsing is not so easy to above format because the nested annotation have to be accommodated.
So, according to me, simple dictionaries for handling the annotation with providers as key and array of pairs of qualifier and identifier as value are insufficient for handling these complex annotation structures. We need proper dedicated classes to handle the annotation and special methods to verify its internal content. Talking about the JSON representation, I would go with the thought of releasing a version 2 for JSON representation of SBML models. It will have support for handling the model history and the notes part also (along with the incorporation of additional SBML packages features like groups and fbc-v3) and a new representation of the annotation part. I really can't figure out why "providers" are used as keys instead of biological qualifiers (which is a better option because than the tree structure would be the same as that of SBML). With qualifiers as keys, it will be very easy to handle all the above issues listed and also to incorporate any further addons to annotations in the future. Its structure would be somewhat like follows:
The string ‘ ... ’ is a placeholder for zero or more elements of the same form as the immediately preceding element and [NESTED_CONTENT] represents the nested annotation and has the same format and restrictions as the "qualifier". So, suppose we have the following SBML annotation that is to be represented in JSON:
Its JSON representation would be as follows:
Now though this structure will be completely different from the existing format, it will be the most general format handling all the existing issues and can also incorporate any future add-ons. To add support for handling the already existing JSON-SBML models, what best we can do is to enable the reading of both formats from JSON but write only in the above format. Default values for qualifiers (i.e "is") and resolver (i.e "http://identifiers.org/") will be used to parse the current JSON format (version1 ) to the above format (version 2). |
Thanks for working this out, @Hemant27031999! Question: Would splitting the identifier from the rest of the URI give much of an advantage? |
I was also thinking about it. Separating the identifier from the main URI is just bringing extra overhead and I really can't figure out what is the purpose of doing so? I Did it because, in the current version, they are separated first and then put into the JSON. I thought that maybe, the members wanted to be more verbose about the provider and identifier. |
First of all, thank you for a well worked-out proposal. This looks very promising!
To me, this has always been about mapping data. A very typical format for metabolite concentrations, gene expression, or protein abundances is a simple table where one column is an identifier in a particular namespace and a second column contains values with a particular unit. I want it to be as easy as possible to decide which components of the model are affected by a given data set. @Hemant27031999 can you provide a link to your actual proposal as well, please? I do have some thoughts on a software design for handling input/output. I would like to see what you have planned and discuss it with you and @matthiaskoenig. |
Sure @Midnighter, here is the link to the google doc: https://docs.google.com/document/d/1f0PknoPUpcWRHS8jljEOX5439us4ZD9qjNlbbfGk8Y8/edit?usp=sharing |
I think the separation of provider and identifier is historically just there to push it into a key-value format. The key-value format was there since SBML is just one of the supported file formats and JSON and YAML do not impose as many restrictions on annotations as SBML. Many other libraries (BIGG, ME models, MICOM soon) use the JSON format because it is faster but also because it allows to add custom annotations to the cobra objects and this is usually done in a key-value store way. So I would keep that in mind and try to find a schema that is compatible with all file formats and not only SBML. The nested format definitely looks complicated and I am not sure what to do about that. The scheme suggested by @Hemant27031999 would probably work since it would basically be a dictionary but I would not limit it to the bqbiol tags. |
@cdiener: Just a very small comment on the side; custom annotations have been part of SBML since its inception. Here is an example from the SBML Level 1 Version 1 specification from March 2nd 2001 showing how to implement arbitrarily complex key-value pairs and more:
(Note: SBML Level 1 is, of course, outdated and should not be used anymore, but this feature, custom annotations, has been part of every SBML release.) Now, with the next version of the FBC package, key-value pairs will become even more straightforward to use. But @Hemant27031999's project aims at developing a lossless representation of SBML in COBRApy-compliant JSON so that in the end, it won't matter anymore if the serialization goes to XML or JSON as long as all required data structures are adequately represented. |
@draeger Oh, that is very cool. Could you point me to a doc link how to do that in SBML 3? |
@cdiener, regarding the key-value storing way, we are going to have that also in the JSON format as this is one of the newly introduced feature in fbc-v3 package, the support of which is to be added in the JSON as part of milestone 1.1 discussed in the proposal. What I have discussed in this thread is only that part of annotation which refers to the external resources (basically the CVTerm class of libSBML).
I took the reference of SBML L3V2 specification doc for this part which only says that the nested content has the same format and restrictions as RELATION_ELEMENT (page 101). |
Sure. In SBML Level 3 the element is called Here is an example from SBML Level 3 Version 2 Release 2, page 17, lines 9-12 that corresponds to the one above:
|
Thank you! Will play around with that for MICOM a bit... |
@cdiener and @Hemant27031999: It is always the preferred way to use a data structure that is defined in the specification directly. The example for using custom annotations is only recommended to use in two fundamental cases:
In the any case, one should write some sort of a documentation what exactly goes there. Using the new key-value pair data structure from the upcoming FBC Version 3 package is going to be the preferred way of storing that kind of information. |
Is there any description of best practices regarding annotations in the cobrapy documentation? I looked and couldn't find any. |
@danolson1 not in the cobrapy documentation as far as I'm aware (it's been a while that I've read it comprehensively) but Memote has an entire module dedicated to best-practise annotations. We don't mention it in-depth in the publication but the source code gives you some indication: To summarize:
MIRIAM => http://www.ebi.ac.uk/miriam/ now called https://identifiers.org/ Hope that helps! |
Yep exactly like @ChristianLieven said. MEMOTE is awesome for checking if your model is well annotated and will tell you what should be added as well. |
I have a related problem with the SBML file output in cobrapy 0.25.0 which ensures that MEMOTE does not recognize the annotations which are stored in the SBML file as such:
Now when loading this model all the annotations are in the notes of the metabolites, so I ran the following code: cobra.io.write_sbml_model(model, 'test_annotation.xml')` And ran a snapshot with memote and it still does not recognize it and I found out why:
There is a / instead of a : after each database entry and the file (which was generated with CarveMe) has some misnomers for the database identifiers. I can fix the database identifiers but not the way that cobrapy converts these key:value pairs to key/value in the SBML file. |
I figured out why it converts the key:value pairs to key/value. My values are strings and not lists. Why is this a thing? And if it is necessary, why is this not in the documentation? |
@dtmvandenberg I think this would have been better in a discussion since it isn't really related to the issue which discusses the new annotation interface... Both forms are valid on identifiers.org, The reason MIRIAM was not mentioned explicitly is that we wanted to support all kinds of annotations, not just MIRIAM-compliant ones, but honestly that did not work very well. The new interface will change annotations quite a bit and will clearly separate standardized from custom annotations. I agree that the docs for the new annotations should explain what MIRIAM-compliant annotations are and link to the registry. SBML is quite adamant that annotations should never be stored in the notes. Which software/tool created the notes like this? I thought CARVEME propagates annotations from BIGG which stores them in the right format... |
@cdiener Thanks for the quick and clear reply. My apologies for posting it in the wrong place, I thought the suggestions in my comments were most suited to this topic. I did use Carveme 1.50 to generate the original model and did not change anything about the annotations yet. It was also only modified in cobrapy. I checked the original model and see the same problem. |
No problem, this was just so it does not get ignored. That is interesting with the notes in CARVEME. Maybe that is for compatibility with reframed... |
Hi all,
there are currently some issues with how annotations are stored in cobrapy.
[1] MIRIAM qualifiers are dropped
RDF annotations are triples of the form (subject - predicate - object), with the object being the database which the subject (element in cobrapy) is annotated to. An crucial part of the annotation is the predicate which defines the relationship between the annotated element and the annotation.
http://co.mbine.org/standards/qualifiers
It makes a big difference (for all algorithms working with annotations) if something has a
is
relationship orhasPart
, orisHomologTo
.So instead of storing things like
We should also store the predicates (otherwise a round trip will change the annotations).
[2] Database ids should be identifiers.org collection identifiers
https://www.ebi.ac.uk/miriam/main/collections
http://identifiers.org/documentation
Basically most of the annotations reference a database collections from identifiers.org. The lookup key should be the collection namespace. This is the case for most databases, but is not the case for
SBO
currently which should besbo
.This is a problem because not all users use the sboTerm for sbo annotations, but the alternative use of a RDF annotation to
sbo
collections is also valid. It should not matter if one wrote an SBO term or a SBO annotation, these should be treated the same.Solution: change
SBO
tosbo
key in annotation map.This would affect code working with annotations.
The text was updated successfully, but these errors were encountered: