-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gff3 spec compatibility #91
Comments
It is what I've been doing so far. I was suggesting for your parser to read 'ID' rather than 'Name' when parsing gff3, since this is what is usually expected from the gff3 specs, to avoid the ID/Name replacement that you recommend in your one-liner. It does not change the problem of the fasta sequence at the end of a gff3 that there is with prokka gff3s, but would solve the problems that we regularly have of not having a unique (or sometimes any) 'Name' field in .gff files. If it is too cumbersome, don't mind this issue, it was merely a suggestion. Adelme |
Hi Adelme, Thanks for your kind words and the suggestion. It's something we thought a lot about when we first enabled the combined protein and nucleotide modes for AMRFinderPlus. And I also found the NCBI formats to be annoying :-) From my perspective it's not the GFF that's the problem, but the FASTA file, since there is a perfectly good ID in the NCBI GFF files, it's unique and fits the requirements of the format, it just doesn't appear in the protein FASTA files that NCBI releases. As I'm sure you know there isn't an accepted standard for FASTA IDs, so NCBI created their own standard long before my time. NCBI annotation FASTA files generally just have universal RefSeq accessions as though you had just downloaded individual RefSeq protein sequences, rather than something that is specific to the annotation run. I fully understand the frustration of having to write a wrapper or jump through hoops to get AMRFinderPlus to run on your sequences/annotations, it would bug me too except that I'm almost always running AMRFinderPlus on NCBI sequences. As you might have noticed we added an option ( Would you be willing to send us some examples (GFF + Protein FASTA files) that you encounter frequently so we can think about the best way to handle this in the clearest, most universal, and simplest way because I think you have a good point. Arjun |
Hi, Thank you for your answer and clarification. Indeed you are right, there is a big discrepency in what is used as identifier in protein fasta, and I see where this can be a problem difficult to solve. In any case, I've listed some sources of gff/faa I've used regularly in the past years, appart from Refseq, and a loose description of the formats that they seem to follow as faa identifiers compared to what's in the gff files.
I've put together a pair of gff/faa example for each source in the Hopefully that can be of help, but it does seem like a universal solution would be difficult to achieve. Adelme |
That was very generous and helpful. Thank you! We'll take a look and see what we can come up with. Arjun |
We've added an For details see the Documentation for the --annotation_format option. Thanks again, |
Hi, |
Hi @axbazin, Please do let us know if you have issues. I managed to download and run one or two other examples from each of the data sources / annotation tools you listed to develop and test against, but that's the extent of my experience with most of them, so I'm a little nervous that we made some assumptions based on those examples that don't always hold. Arjun |
Hello,
Thank you for your tool, it is incredibly practical !
There is one cumbersome aspect however, it is the need of a special 'AMRFinder-gff3' format that makes it incompatible with most gff3-generating tools or platforms (Prokka, Rast, Bakta, MicroScope ...) with the exception of refseq gff3.
The gff3 specifications specify that 'ID' should be used as unique identifier while 'Name' should be used as a display name and has no condition of uniqueness whatsoever. It seems that in amrfinder you have followed an opposite rule and chosen to use 'Name' as the unique identifier field.
It would be quite practical to be able to use any gff3 files with your tool without modifying them, like reading 'ID' if 'Name' is not unique or not present, or something equivalent, would retain compatibility with your current behavior, and be quite a time-saver for the futur !
Adelme
The text was updated successfully, but these errors were encountered: