-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: augur translate
not faithful to gff3
standard, can't annotate nucs, which are required.
#953
Comments
I'd propose making two functions
I had this thought too! We have 2 TB tests ( Parsing GFFBioPython doesn't have a parser but recommends gffutils. gffutils has a small set of dependencies but uses a database layer which seemed a bit overkill (although doesn't seem slow in my testing): # using the TB GFF which is in this repo & the gffutils incantation in their tutorial
db = gffutils.create_db("Mtb_H37Rv_NCBI_Annot.gff", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
# are all features read? GFF file has 8407 non-comment lines
len(list(db.all_features())) # 8407
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start # 1
region.stop # 4411532
region.strand # '+'
# The same, but for MPX genome MT903344 which Cornelius references above
db = gffutils.create_db("MT903344.gff3", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
len(list(db.all_features())) # 381, which matches my expectation
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start #1
region.end # 197233
region.strand # '+' name vs locus_tag vs geneWhichever library we use we're going to have to make some decisions (in code or via arguments) about what name to use for features -- id vs name vs locus_tag -- see nextstrain/augurlinos#4 for some discussion about this... 5 years ago! Looking at genes (
Differences in how we treat GFF vs GenBank
|
partially resolves BUG: `augur translate` not faithful to `gff3` standard, can't annotate nucs, which are required. #953
I just ran into this issue, too, trying to use the Nextclade gene map GFF as an input to augur translate. After thinking about this more and reading the comments above, I had a couple of additional thoughts:
As an example of how we might properly format our GFF3 files for parsing in Augur (and elsewhere), here is the gene map for SARS-CoV-2 from
And here is what I would propose it should look like instead:
My general feeling now is that we might:
|
I forgot to include context of what I was doing to run into this same issue: I've been developing an updated Nextstrain introductory tutorial to expand the current "Zika tutorial" and use SARS-CoV-2 instead. I was hoping to use The immediate blocking issue in the new tutorial case is that |
Looking at other Nextclade datasets today, I noticed that most others follow the general GFF3 pattern we'd expect. For example, the H3N2 HA gene map looks like this:
This file validates as expected with the GFF3 validator. Also, this file works well with gffutils. For example, I can run the following code to extract the HA1 sequence from the corresponding reference FASTA file: import gffutils
# Try to use any valid qualifier key as an id in the given order.
db = gffutils.create_db("genemap.gff", ":memory:", id_spec=["locus_tag", "gene", "gene_name"]
# Get coordinates for HA1.
db["HA1"]
# Get sequence for HA1 from a FASTA file.
db["HA1"].sequence("reference.fasta") I like that gffutils lets us specify a list of valid qualifier keys to use as the gene id. This feature would eliminate the need to manually check for each separate key we support in On the other hand, I don't see any functionality in gffutils that parses the >>> from BCBio import GFF
>>> record = list(GFF.parse("tests//builds/tb/data/Mtb_H37Rv_NCBI_Annot.gff"))[0]
>>> record.annotations
{'gff-version': ['3'],
'sequence-region': [('MTB_anc', 0, 4411532)],
'species': ['https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332']} I haven't found any examples of GFF files where people set the |
+1 for "spend[ing] some time with the GFF3 specification" |
gffutils doesn't parse any directives/pragmas, but it does extract them. Parsing is a simple matter after that:
|
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
Related to #881
Currently,
augur translate
does not implementgff3
standard properly. Confusingly, if you use.gff
files generated automatically by Genbank from a given.gb
file, augur translate will behave differently.Importantly,
nuc
s won't be annotated if you use a genemap fromgff
rather thangb
which is clearly a bug.The solution is to:
a) make augur translate accept standard gff3 files, outputting nuc annotations etc
b) make augur translate behave identical whether it reads in a
.gb
or.gff
annotation (that are identical in information).I had opened #950 for a stopgap fix for the "no nuc annotation with .gff" bug, but the actual solution is to read
.gff
properly and refactoraugur translate
to be standards conforming.Below are comments made on the issue and PR respectively.
I got extremely confused by this bug. I encountered it as the following error in a workflow that uses augur translate and just has a .gff as input, not a .gb with nuc annotation.
From @huddlej:
From @emmahodcroft
The text was updated successfully, but these errors were encountered: