gff3 spec compatibility #91

axbazin · 2022-06-17T08:39:54Z

Hello,

Thank you for your tool, it is incredibly practical !
There is one cumbersome aspect however, it is the need of a special 'AMRFinder-gff3' format that makes it incompatible with most gff3-generating tools or platforms (Prokka, Rast, Bakta, MicroScope ...) with the exception of refseq gff3.

The gff3 specifications specify that 'ID' should be used as unique identifier while 'Name' should be used as a display name and has no condition of uniqueness whatsoever. It seems that in amrfinder you have followed an opposite rule and chosen to use 'Name' as the unique identifier field.

It would be quite practical to be able to use any gff3 files with your tool without modifying them, like reading 'ID' if 'Name' is not unique or not present, or something equivalent, would retain compatibility with your current behavior, and be quite a time-saver for the futur !

Adelme

vbrover · 2022-06-17T13:54:38Z

Will this help?
https://github.com/ncbi/amr/wiki/Tips-and-tricks#using-prokka-or-RAST-gff-files-with-amrfinderplus

axbazin · 2022-06-17T14:17:17Z

It is what I've been doing so far.

I was suggesting for your parser to read 'ID' rather than 'Name' when parsing gff3, since this is what is usually expected from the gff3 specs, to avoid the ID/Name replacement that you recommend in your one-liner.

It does not change the problem of the fasta sequence at the end of a gff3 that there is with prokka gff3s, but would solve the problems that we regularly have of not having a unique (or sometimes any) 'Name' field in .gff files.

If it is too cumbersome, don't mind this issue, it was merely a suggestion.

Adelme

evolarjun · 2022-06-21T16:38:27Z

Hi Adelme,

Thanks for your kind words and the suggestion.

It's something we thought a lot about when we first enabled the combined protein and nucleotide modes for AMRFinderPlus. And I also found the NCBI formats to be annoying :-) From my perspective it's not the GFF that's the problem, but the FASTA file, since there is a perfectly good ID in the NCBI GFF files, it's unique and fits the requirements of the format, it just doesn't appear in the protein FASTA files that NCBI releases.

As I'm sure you know there isn't an accepted standard for FASTA IDs, so NCBI created their own standard long before my time. NCBI annotation FASTA files generally just have universal RefSeq accessions as though you had just downloaded individual RefSeq protein sequences, rather than something that is specific to the annotation run.

I fully understand the frustration of having to write a wrapper or jump through hoops to get AMRFinderPlus to run on your sequences/annotations, it would bug me too except that I'm almost always running AMRFinderPlus on NCBI sequences. As you might have noticed we added an option (--pgap) for the output of our annotation pipeline (PGAP) run independently of the internal pipeline.

Would you be willing to send us some examples (GFF + Protein FASTA files) that you encounter frequently so we can think about the best way to handle this in the clearest, most universal, and simplest way because I think you have a good point.

Arjun

axbazin · 2022-06-22T09:21:38Z

Hi,

Thank you for your answer and clarification.

Indeed you are right, there is a big discrepency in what is used as identifier in protein fasta, and I see where this can be a problem difficult to solve.

In any case, I've listed some sources of gff/faa I've used regularly in the past years, appart from Refseq, and a loose description of the formats that they seem to follow as faa identifiers compared to what's in the gff files.

Source	gff Name field	gff field(s) used in protein faa identifiers
prokka (https://github.com/tseemann/prokka)	gene names, not unique	">" ID product
bakta (https://github.com/oschwengers/bakta)	identical to the product field, not unique	">" ID product
PATRIC (https://www.patricbrc.org/)	None, they use 'gene', not unique	">" ID "\|" locus_tag product "["species"]"
MicroScope (https://mage.genoscope.cns.fr/microscope/ )	None, they use 'gene', not unique	">" locus_tag "\| ID:" ID "\|"gene"\|" product "["species"]"
https://pseudomonas.com/	'name', not unique and it looks like what is usually in the 'product' field	">" Alias "\| ref:" Dbxref=RefSeq

I've put together a pair of gff/faa example for each source in the
annotation_files.tar.gz file that should be attached to the issue, as you wished.

Hopefully that can be of help, but it does seem like a universal solution would be difficult to achieve.

Adelme

evolarjun · 2022-06-22T22:14:59Z

That was very generous and helpful. Thank you!

We'll take a look and see what we can come up with.

Arjun

evolarjun · 2022-08-15T13:08:57Z

We've added an --annotation_format feature to AMRFinderPlus in version 3.10.40. The release hasn't yet wound its way through bioconda, and there are some delays in building the docker image, but that all should be complete by the end of the day. @axbazin please give it a try when you get a chance and let us know if you find any problems.

For details see the Documentation for the --annotation_format option.

Thanks again,
Arjun

axbazin · 2022-08-16T09:01:46Z

Hi,
Thanks a lot for this update, this will make things so much more practical when using annotation tools !
I will test this asap and provide feedback if I find any problem.
Adelme

evolarjun · 2022-08-17T13:38:03Z

Hi @axbazin,

Please do let us know if you have issues. I managed to download and run one or two other examples from each of the data sources / annotation tools you listed to develop and test against, but that's the extent of my experience with most of them, so I'm a little nervous that we made some assumptions based on those examples that don't always hold.

Arjun

evolarjun added enhancement New feature or request labels Jul 15, 2022

evolarjun closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gff3 spec compatibility #91

gff3 spec compatibility #91

axbazin commented Jun 17, 2022

vbrover commented Jun 17, 2022

axbazin commented Jun 17, 2022

evolarjun commented Jun 21, 2022

axbazin commented Jun 22, 2022 •

edited

Loading

evolarjun commented Jun 22, 2022

evolarjun commented Aug 15, 2022 •

edited

Loading

axbazin commented Aug 16, 2022

evolarjun commented Aug 17, 2022

gff3 spec compatibility #91

gff3 spec compatibility #91

Comments

axbazin commented Jun 17, 2022

vbrover commented Jun 17, 2022

axbazin commented Jun 17, 2022

evolarjun commented Jun 21, 2022

axbazin commented Jun 22, 2022 • edited Loading

evolarjun commented Jun 22, 2022

evolarjun commented Aug 15, 2022 • edited Loading

axbazin commented Aug 16, 2022

evolarjun commented Aug 17, 2022

axbazin commented Jun 22, 2022 •

edited

Loading

evolarjun commented Aug 15, 2022 •

edited

Loading