Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature skip prokka #64

Merged
merged 15 commits into from
Oct 16, 2024
Merged

Conversation

fischer-hub
Copy link
Collaborator

Add new parameters annotation_file and protein_fasta_file that both have to be provided to skip the default annotation with Prokka. Instead the provided annotation GFF and translated CDS protein FAA are used for the downstream pipeline.

Here are some example GFF and FAA files that are based on the test data set included in ./data.
ribap_test_data.zip

After unzipping them to ./data the new parameters can be tested as follows:

nextflow run ribap.nf --fasta 'data/*.fasta' --annotation_file 'data/*.gff' --protein_fasta_file 'data/*.faa' -profile conda,slurm

I also changed the code of the roary and strain_ids processes to accept the .gff3 file extension as it is used by e.g. Bakta.
As mentioned in #35, it might be interesting to introduce some GFF format check with this option to make sure the downstream processes don't crash. I tried this with genomeTools gffValidator but I think this might introduce unnecessary issues. For example the default Bakta GFF3 annotation files run through the pipeline without any issues with the new parameters, but the gffValidator raises an error on the 4th line already. So we might need some other solution or even some custom script checking for the parts of the GFF file that we actually need later on. Just as an idea though.

Since I ran into a similar issue as reported in #58 while testing I also added as small check that sets the chunk size to 1 if blastTable has smaller length than #cores or there are no target strains different from the query strains, resulting in a crash of the script. I think this only happens when running with a very small number of genomes like 2-3 so maybe not so relevant but the user now gets a small error message instead of the python 'out of range' error.

closes #35

@hoelzer
Copy link
Contributor

hoelzer commented Aug 28, 2024

Awesome @fischer-hub thx! Will test and check.

We can make the GFF validation a new issue then, just to keep that documented. But good idea. probably an own script checking for the formatting so down-stream tasks work. This is the way.

@hoelzer
Copy link
Contributor

hoelzer commented Aug 28, 2024

From a first code review and test: looks good! What is important: that we dont accidentally mix up the wrong FAA and GFF files. But I think this can anyway not happen in the workflow because we only provide Roary with the gff_ch (and the GFF also includes the whole genome sequence from which Roary extracts the CDS). The faa_ch is later used in mmseqs to also calculate the alignments.

The combine_ch then looks like this:

[95, /scratch/hoelzerm/git/ribap-fischer-hub/work/0c/1ad46932b1967bc5afd99bb9834484/95, /scratch/hoelzerm/git/ribap-fischer-hub/work/5d/2a349b68eb982c7568220589ccf6d0/strain_ids.txt, [/scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cav_10DC88_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cav_11DC096_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cga_08-1274-3_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cga_12-4358_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Ctr_A-HAR-13_RENAMED.gff]]

and the combine_roary_ilp.py script should take care of matching the correct files belonging to the same strain.

I just wanted to double-check with you. Bc I e.g. saw that you removed the value name here:

https://github.com/hoelzer-lab/ribap/pull/64/files#diff-ee63ef103c15d600628d1e4d262091219d58c3706387d27bc4a9fbcd3cf94c33L12

but which I think we anyway did not use.

Or in other words: how is the matching of the results done? Do the GFF and FAA files need to be named like the FASTA genome files + the _RENAMED part?


Another thing: Is the --fasta input of the genomes actually still needed when we anyway skip the annotation step? So one can run the pipeline either via

  1. genome FASTA input (will run annotation, --fasta
  2. protein FASTA and annotation GFF input (will skip annotation, --annotation_file, --protein_fasta_file)

EDIT: ah probably it is needed currently. Because then the FASTA genome input files are renamed in the first step and then they match the additionally provided FAA/GFF files?


@fischer-hub
Copy link
Collaborator Author

fischer-hub commented Aug 29, 2024

Not sure how the assignment of annotation / seqs works I would figure its done via the seq headers maybe? Will check the code again, however the samplename I removed from the Prokka output channel was not used anywhere downstream and caused issues when Prokka was skipped, so I removed it. But since the combine_ilp process anyway runs on all results at once I would think it uses file/seq names to match to the samples!

EDIT:
After looking into the code again, the protein fasta files are only used by mmseqs2 which combines all sequences before clustering:

    cat *.faa > mmseq2/all_proteins.fa

The gff files are used by Roary which runs on all gffs to build the pangenome graph and by combine_roary_ilp.py which pulls feature names and annotations from the gff files. I think we don't get into the situation of mismatching gff and protein fasta in the first place :)

I added a commit to skip the renaming and nucleotide fasta processes when annotation gff and protein fasta are provided, and also added an information print in case someone provides all 3 parameters that the nucleotide fasta will be ignored.

Btw, what did the comparison of the outputs say?

…mation if additionally nucleotyide fasta file is provided
@hoelzer
Copy link
Contributor

hoelzer commented Sep 2, 2024

Hey,

regarding "mismatching protein faa and annotation GTF": ok! Thanks for checking... I also thought so and that we implemented everything in a way that we can not mix up files of different samples. But always good to double-check.

Regarding the comparison:


I just tested the Brucella melitensis data set from the paper and compared the output from the paper with the one when I provide the Prokka GFF/FAA (and the genomes) directly:

nextflow run /scratch/hoelzerm/git/ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' --annotation_file 'data/prokka/brucella/*.gff' --protein_fasta_file 'data/prokka/brucella/*.faa' -profile singularity,slurm --singularityCacheDir /biolibs/mf1/singularity/ribap/ --output 2024-08-28-results-skip-prokka-test

The output of the run skipping Prokka:

We summarized 7627 Roary clusters into 5677 RIBAP groups with the help of the pairwise ILPs. Of those new 5677 groups, 2449 build core gene set including all 71 input genomes.

The output I produced before:

We summarized 7644 Roary clusters into 5675 RIBAP groups with the help of the pairwise ILPs. Of those new 5675 groups, 2444 build core gene set including all 71 input genomes.

So there are slight differences BUT there might be also slight differences in the pipeline (tool versions, ...).

And I noticed smt by accident: In the above command, I inputed 24 genomes via --fasta (only Brucella melitensis) BUT I provided FAA and GFF files for all 71 Brucella species (including the 24 melitensis but also other species). Nevertheless, the pipeline run trough and apparently on all 71 strains:

[ad/5b4b9a] process > RIBAP:rename (Brucella_melitensis_strain_BwIM_TUR_39_full_genome) [100%] 24 of 24, cached: 24 ✔
[35/ae1422] process > RIBAP:strain_ids                                                  [100%] 1 of 1, cached: 1 ✔
[47/94fcd7] process > RIBAP:roary (1)                                                   [100%] 5 of 5, cached: 5 ✔
[0a/31fc78] process > RIBAP:mmseqs2                                                     [100%] 1 of 1, cached: 1 ✔
[4d/e6a193] process > RIBAP:mmseqs2tsv                                                  [100%] 1 of 1, cached: 1 ✔
[d0/c849c2] process > RIBAP:ilp_refinement (7)                                          [100%] 9 of 9, cached: 1 ✔
[d4/7fb4e4] process > RIBAP:combine_roary_ilp (1)                                       [100%] 1 of 1 ✔
[26/70062f] process > RIBAP:prepare_msa (1)                                             [100%] 1 of 1 ✔
[51/3a0e77] process > RIBAP:mafft (62)                                                  [100%] 74 of 74 ✔
[91/4b8274] process > RIBAP:fasttree (73)                                               [100%] 74 of 74 ✔
[0d/bc4411] process > RIBAP:nw_display (74)                                             [100%] 74 of 74 ✔
[7b/3ccbc6] process > RIBAP:generate_html (1)                                           [100%] 1 of 1 ✔
[e0/39406a] process > RIBAP:generate_upsetr_input (1)                                   [100%] 1 of 1 ✔
[85/be58fd] process > RIBAP:upsetr (1)                                                  [100%] 1 of 1 ✔
Completed at: 31-Aug-2024 23:09:56
Duration    : 1d 13h 51m 59s
CPU hours   : 1'220.5 (6.4% cached)
Succeeded   : 235
Cached      : 33

@fischer-hub So again this is related to one of my questions above: is the --fasta input of the genomes actually necessary when one is using FAA/GFF input and skipping Prokka?

@hoelzer
Copy link
Contributor

hoelzer commented Sep 2, 2024

I will pull your changes and run another test, now really only on the 24 B. melentitis strains and I do it once w/ Prokka and once skipping Prokka.

@hoelzer
Copy link
Contributor

hoelzer commented Sep 3, 2024

Here are the results of the head-to-head test on 24 Brucella melitensis samples:

nextflow run ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' -profile singularity,slurm --output 2024-09-02-results-brucella-melitensis-with-prokka
#Completed at: 02-Sep-2024 14:46:04
#Duration    : 3h 55m 52s
#CPU hours   : 133.6
#Succeeded   : 265

We summarized 3474 Roary clusters into 3396 RIBAP groups with the help of the pairwise ILPs.
Of those new 3396 groups, 3035 build core gene set including all 24 input genomes.

# w/o prokka
nextflow run ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' --annotation_file '2024-09-02-results-brucella-melitensis-with-prokka/01-prokka/*/*.gff' --protein_fasta_file '2024-09-02-results-brucella-melitensis-with-prokka/01-prokka/*/*.faa' -profile singularity,slurm --output 2024-09-02-results-skip-prokka-test
# Completed at: 02-Sep-2024 18:49:38
# Duration    : 3h 45m 55s
# CPU hours   : 107.4
# Succeeded   : 217

We summarized 3471 Roary clusters into 3394 RIBAP groups with the help of the pairwise ILPs.
Of those new 3394 groups, 3037 build core gene set including all 24 input genomes.

So again, slight differences. Not sure why... it might be that there are cases of gene duplicates that can be resolved in different ways (not deterministically).

Here are the final RIBAP holy tables (based on ILP-combined Roary 95% results):

2024-09-02-results-brucella-melitensis-skip-prokka.csv
2024-09-02-results-brucella-melitensis-with-prokka.csv

@fischer-hub I was not sure how to compare them easily... maybe you have an idea? I dont expect we get exact matching results (based on the summaries) but would be good to know how much different.

And anyway, I think this difference then does not come from the switch to "use prokka" vs "skip prokka" but are intrinsic to the current method implementation.

@fischer-hub
Copy link
Collaborator Author

Thanks for testing! Will look into the CSVs. When you provided the annotation yourself, were these GFF files produced by Prokka too? Im just wondering if the difference in the final result is because of different input annotations or as you said some non-deterministic effect in e.g. the ILPs.

@hoelzer
Copy link
Contributor

hoelzer commented Sep 3, 2024

I first run RIBAP with Prokka. And then I run it a second time without Prokka but giving the Prokka files from the previous run as input. So the Prokka files (and gene IDs) should be exactly the same

@hoelzer
Copy link
Contributor

hoelzer commented Oct 16, 2024

I will merge this now but would still consider this a wip/testing feature.

@hoelzer hoelzer merged commit 812b1f8 into hoelzer-lab:main Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Start ribap from own annotation
2 participants