-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest alignment to Pathogen repo guide and add frequency panel #88
Conversation
This commit updates the `ingest` directory to match the latest version of the: https://github.com/nextstrain/pathogen-repo-guide/tree/b4001735fa55448f1426ff90cca93c667ca121ce/ingest
Since auspice automatically detects "authors" and we prefer the abbreviated authors list displayed, change "abbr_authors" to "authors" and "authors" to "full_authors". This matches the pathogen-repo-guide
To match the pathogen repo guide, change: * `genbank_accession` to `accession` * `genbank_accession_rev` to `accession_version` There should be a subsequent change in the phylogenetic workflow
To be more consistent with the pathogen-repo-guide and the ingest workflow
Use the combination of `accession` and `url` fields in the phylogenetic build. This change follows a similar change to the ingest workflow.
073bcf2
to
a56685a
Compare
98d7a19
to
65520e5
Compare
Something is off in the frequency calculations. Here's a view from https://nextstrain.org/staging/dengue/trials/2024bfreq/dengue/all/genome?d=tree,frequencies&f_region=South%20America that just filters to South American sequences. I can mouse-over and clearly see a large number of samples from the last five years. However, none of these samples are showing up in the frequencies panel. You can see a similar issue without even filtering. It looks there's a single sample that's contributing to the KDE frequencies post 2020. |
Thank you! I was also expecting frequencies to sum up to 100%. The code changes to add frequencies panel are here, in case someone sees what's wrong faster than I can: I'm not entirely certain what is causing the empty regions:
I was mainly following frequency parameters for yellow-fever and measles but feel like I'm missing a nuance somewhere. |
output: | ||
tip_freq = "auspice/dengue_{serotype}_{gene}_tip-frequencies.json" | ||
params: | ||
strain_id = config["strain_id_field"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain what is causing the empty regions
I was curious what was going on here, so I took a look. There is mismatch in the strain id used in the main JSON and the frequencies JSON so Auspice cannot link the strains to their frequencies.
Since the main JSON is getting it's strain name set to the display_strain_field
in final_strain_name
, the frequencies JSON needs to use the same field here:
strain_id = config["strain_id_field"], | |
strain_id = config["display_strain_field"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the underlying issue was correct, but the suggestion caused CI to fail. This because the newick tree uses the accession as id so it cannot be matched with the metadata strain.
Separately discussed with @j23414 to remove set_final_strain_name.py
as suggested in nextstrain/public#5 so that both use the accession as the strain id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just logging that @joverlee521 and I chatted that augur frequencies takes tree.json
file, which is ID'd on accession
to match against the metadata.tsv
file. Ergo, it might be time to remove set_final_strain_name.py
so that both the main JSON and the frequencies JSON use the accession as the strain id.
This falls inline with nextstrain/public#5, see nextstrain/zika#72 as an example
I'll make the changes and re-run tests, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know that Auspice warns about this! I didn't look at the console logs, I was just inspecting the frequencies JSON and noticed the strain names were all accessions.
34bad54
to
357eea4
Compare
…al auspice JSON file @j23414 initially believed that the Augur frequency function could utilize the 'accession' strain IDs to match the tree.json file. However, @joverlee521 pointed out that the frequencies JSON actually requires matching the final 'strain' IDs as found in the Auspice JSON file. Otherwise Auspice cannot link the strains to their frequencies. Co-authored-by: Jover Lee <joverlee521@gmail.com>
Remove set_final_strain_name.py so that both the main JSON and the frequencies JSON use the accession as the strain id. This simplifies the augur frequencies call and the "accession" vs "strain" confusion.
357eea4
to
b8b2fa3
Compare
This looks good to me now. Thanks for fixing things up. |
Description of proposed changes
Since I was already in this repository to add the frequency panel, I decided to initiate an update to the workflow to align it with the latest version of the Pathogen Repo Guide.
Draft Frequencies Plot
Added frequencies plots from Actions run: https://github.com/nextstrain/dengue/actions/runs/12433607449
full genome frequencies:
E gene frequencies:
Feedback welcome
Related issue(s)
config
todefaults
#77Checklist