Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest alignment to Pathogen repo guide and add frequency panel #88

Merged
merged 10 commits into from
Jan 4, 2025

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Dec 18, 2024

Description of proposed changes

Since I was already in this repository to add the frequency panel, I decided to initiate an update to the workflow to align it with the latest version of the Pathogen Repo Guide.

Draft Frequencies Plot

Added frequencies plots from Actions run: https://github.com/nextstrain/dengue/actions/runs/12433607449

full genome frequencies:

E gene frequencies:

Feedback welcome

Related issue(s)

Checklist

  • Checks pass

Since auspice automatically detects "authors" and we prefer the abbreviated authors
list displayed, change "abbr_authors" to "authors" and "authors" to "full_authors".

This matches the pathogen-repo-guide
To match the pathogen repo guide, change:

* `genbank_accession` to `accession`
* `genbank_accession_rev` to `accession_version`

There should be a subsequent change in the phylogenetic workflow
To be more consistent with the pathogen-repo-guide and the ingest workflow
Use the combination of `accession` and `url` fields in the phylogenetic build.
This change follows a similar change to the ingest workflow.
@j23414 j23414 force-pushed the pathogen-repo-guide branch from 073bcf2 to a56685a Compare December 18, 2024 17:38
@j23414 j23414 changed the title WIP: Latest alignment to Pathogen repo guide WIP: Latest alignment to Pathogen repo guide and add frequency panel Dec 20, 2024
@j23414 j23414 force-pushed the pathogen-repo-guide branch from 98d7a19 to 65520e5 Compare December 20, 2024 20:58
@j23414 j23414 changed the title WIP: Latest alignment to Pathogen repo guide and add frequency panel Latest alignment to Pathogen repo guide and add frequency panel Dec 20, 2024
@j23414 j23414 marked this pull request as ready for review December 20, 2024 20:58
@trvrb
Copy link
Member

trvrb commented Dec 20, 2024

Something is off in the frequency calculations. Here's a view from https://nextstrain.org/staging/dengue/trials/2024bfreq/dengue/all/genome?d=tree,frequencies&f_region=South%20America that just filters to South American sequences. I can mouse-over and clearly see a large number of samples from the last five years. However, none of these samples are showing up in the frequencies panel.

dengue

You can see a similar issue without even filtering. It looks there's a single sample that's contributing to the KDE frequencies post 2020.

Screenshot 2024-12-20 at 1 17 41 PM

@trvrb
Copy link
Member

trvrb commented Dec 20, 2024

Compare measles example here that without normalizing frequencies you should have things sum to 100% without any filters in place:

Screenshot 2024-12-20 at 1 24 51 PM

@j23414
Copy link
Contributor Author

j23414 commented Dec 20, 2024

Thank you! I was also expecting frequencies to sum up to 100%. The code changes to add frequencies panel are here, in case someone sees what's wrong faster than I can:

I'm not entirely certain what is causing the empty regions:

  • perhaps unassigned lineage strains?
  • perhaps this requires fine tuning the bandwidth parameters? I'm looking at the augur frequencies documentation
  • perhaps drop the max-date 6M filter (might be why recent samples are missing)

I was mainly following frequency parameters for yellow-fever and measles but feel like I'm missing a nuance somewhere.

output:
tip_freq = "auspice/dengue_{serotype}_{gene}_tip-frequencies.json"
params:
strain_id = config["strain_id_field"],
Copy link
Contributor

@joverlee521 joverlee521 Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain what is causing the empty regions

I was curious what was going on here, so I took a look. There is mismatch in the strain id used in the main JSON and the frequencies JSON so Auspice cannot link the strains to their frequencies.

Since the main JSON is getting it's strain name set to the display_strain_field in final_strain_name, the frequencies JSON needs to use the same field here:

Suggested change
strain_id = config["strain_id_field"],
strain_id = config["display_strain_field"],

Copy link
Contributor

@joverlee521 joverlee521 Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the underlying issue was correct, but the suggestion caused CI to fail. This because the newick tree uses the accession as id so it cannot be matched with the metadata strain.

Separately discussed with @j23414 to remove set_final_strain_name.py as suggested in nextstrain/public#5 so that both use the accession as the strain id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just logging that @joverlee521 and I chatted that augur frequencies takes tree.json file, which is ID'd on accession to match against the metadata.tsv file. Ergo, it might be time to remove set_final_strain_name.py so that both the main JSON and the frequencies JSON use the accession as the strain id.

This falls inline with nextstrain/public#5, see nextstrain/zika#72 as an example

I'll make the changes and re-run tests, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice debugging! Whenever things look strange in Auspice I'd suggest first checking the console logs which often contain useful information (i.e. the browser's developer console when viewing Auspice). In this case it had hundreds of warnings logged - perhaps Jover did just this?

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know that Auspice warns about this! I didn't look at the console logs, I was just inspecting the frequencies JSON and noticed the strain names were all accessions.

@j23414 j23414 force-pushed the pathogen-repo-guide branch from 34bad54 to 357eea4 Compare December 31, 2024 23:22
j23414 and others added 2 commits December 31, 2024 15:28
…al auspice JSON file

@j23414 initially believed that the Augur frequency function could utilize the 'accession' strain IDs to match the tree.json file. However, @joverlee521 pointed out that the frequencies JSON actually requires matching the final 'strain' IDs as found in the Auspice JSON file. Otherwise Auspice cannot link the strains to their frequencies.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Remove set_final_strain_name.py so that both the main JSON and the
frequencies JSON use the accession as the strain id. This simplifies the
augur frequencies call and the "accession" vs "strain" confusion.
@j23414 j23414 force-pushed the pathogen-repo-guide branch from 357eea4 to b8b2fa3 Compare December 31, 2024 23:30
@trvrb
Copy link
Member

trvrb commented Jan 4, 2025

This looks good to me now. Thanks for fixing things up.

@j23414 j23414 merged commit 8b784e8 into main Jan 4, 2025
13 checks passed
@j23414 j23414 deleted the pathogen-repo-guide branch January 4, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants