-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature skip prokka #64
Conversation
Awesome @fischer-hub thx! Will test and check. We can make the GFF validation a new issue then, just to keep that documented. But good idea. probably an own script checking for the formatting so down-stream tasks work. This is the way. |
From a first code review and test: looks good! What is important: that we dont accidentally mix up the wrong FAA and GFF files. But I think this can anyway not happen in the workflow because we only provide Roary with the The [95, /scratch/hoelzerm/git/ribap-fischer-hub/work/0c/1ad46932b1967bc5afd99bb9834484/95, /scratch/hoelzerm/git/ribap-fischer-hub/work/5d/2a349b68eb982c7568220589ccf6d0/strain_ids.txt, [/scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cav_10DC88_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cav_11DC096_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cga_08-1274-3_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Cga_12-4358_RENAMED.gff, /scratch/hoelzerm/git/ribap-fischer-hub/test-data/Ctr_A-HAR-13_RENAMED.gff]] and the I just wanted to double-check with you. Bc I e.g. saw that you removed the value name here: but which I think we anyway did not use. Or in other words: how is the matching of the results done? Do the GFF and FAA files need to be named like the FASTA genome files + the Another thing: Is the
EDIT: ah probably it is needed currently. Because then the FASTA genome input files are renamed in the first step and then they match the additionally provided FAA/GFF files? |
Not sure how the assignment of annotation / seqs works I would figure its done via the seq headers maybe? Will check the code again, however the samplename I removed from the Prokka output channel was not used anywhere downstream and caused issues when Prokka was skipped, so I removed it. But since the combine_ilp process anyway runs on all results at once I would think it uses file/seq names to match to the samples! EDIT:
The gff files are used by I added a commit to skip the renaming and nucleotide fasta processes when annotation gff and protein fasta are provided, and also added an information print in case someone provides all 3 parameters that the nucleotide fasta will be ignored. Btw, what did the comparison of the outputs say? |
…mation if additionally nucleotyide fasta file is provided
Hey, regarding "mismatching protein faa and annotation GTF": ok! Thanks for checking... I also thought so and that we implemented everything in a way that we can not mix up files of different samples. But always good to double-check. Regarding the comparison: I just tested the Brucella melitensis data set from the paper and compared the output from the paper with the one when I provide the Prokka GFF/FAA (and the genomes) directly: nextflow run /scratch/hoelzerm/git/ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' --annotation_file 'data/prokka/brucella/*.gff' --protein_fasta_file 'data/prokka/brucella/*.faa' -profile singularity,slurm --singularityCacheDir /biolibs/mf1/singularity/ribap/ --output 2024-08-28-results-skip-prokka-test The output of the run skipping Prokka:
The output I produced before:
So there are slight differences BUT there might be also slight differences in the pipeline (tool versions, ...). And I noticed smt by accident: In the above command, I inputed 24 genomes via [ad/5b4b9a] process > RIBAP:rename (Brucella_melitensis_strain_BwIM_TUR_39_full_genome) [100%] 24 of 24, cached: 24 ✔
[35/ae1422] process > RIBAP:strain_ids [100%] 1 of 1, cached: 1 ✔
[47/94fcd7] process > RIBAP:roary (1) [100%] 5 of 5, cached: 5 ✔
[0a/31fc78] process > RIBAP:mmseqs2 [100%] 1 of 1, cached: 1 ✔
[4d/e6a193] process > RIBAP:mmseqs2tsv [100%] 1 of 1, cached: 1 ✔
[d0/c849c2] process > RIBAP:ilp_refinement (7) [100%] 9 of 9, cached: 1 ✔
[d4/7fb4e4] process > RIBAP:combine_roary_ilp (1) [100%] 1 of 1 ✔
[26/70062f] process > RIBAP:prepare_msa (1) [100%] 1 of 1 ✔
[51/3a0e77] process > RIBAP:mafft (62) [100%] 74 of 74 ✔
[91/4b8274] process > RIBAP:fasttree (73) [100%] 74 of 74 ✔
[0d/bc4411] process > RIBAP:nw_display (74) [100%] 74 of 74 ✔
[7b/3ccbc6] process > RIBAP:generate_html (1) [100%] 1 of 1 ✔
[e0/39406a] process > RIBAP:generate_upsetr_input (1) [100%] 1 of 1 ✔
[85/be58fd] process > RIBAP:upsetr (1) [100%] 1 of 1 ✔
Completed at: 31-Aug-2024 23:09:56
Duration : 1d 13h 51m 59s
CPU hours : 1'220.5 (6.4% cached)
Succeeded : 235
Cached : 33 @fischer-hub So again this is related to one of my questions above: is the |
I will pull your changes and run another test, now really only on the 24 B. melentitis strains and I do it once w/ Prokka and once skipping Prokka. |
Here are the results of the head-to-head test on 24 Brucella melitensis samples: nextflow run ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' -profile singularity,slurm --output 2024-09-02-results-brucella-melitensis-with-prokka
#Completed at: 02-Sep-2024 14:46:04
#Duration : 3h 55m 52s
#CPU hours : 133.6
#Succeeded : 265
We summarized 3474 Roary clusters into 3396 RIBAP groups with the help of the pairwise ILPs.
Of those new 3396 groups, 3035 build core gene set including all 24 input genomes.
# w/o prokka
nextflow run ribap-fischer-hub/ribap.nf --fasta 'data/genomes/Brucella/melitensis/*.fna' --annotation_file '2024-09-02-results-brucella-melitensis-with-prokka/01-prokka/*/*.gff' --protein_fasta_file '2024-09-02-results-brucella-melitensis-with-prokka/01-prokka/*/*.faa' -profile singularity,slurm --output 2024-09-02-results-skip-prokka-test
# Completed at: 02-Sep-2024 18:49:38
# Duration : 3h 45m 55s
# CPU hours : 107.4
# Succeeded : 217
We summarized 3471 Roary clusters into 3394 RIBAP groups with the help of the pairwise ILPs.
Of those new 3394 groups, 3037 build core gene set including all 24 input genomes. So again, slight differences. Not sure why... it might be that there are cases of gene duplicates that can be resolved in different ways (not deterministically). Here are the final RIBAP holy tables (based on ILP-combined Roary 95% results): 2024-09-02-results-brucella-melitensis-skip-prokka.csv @fischer-hub I was not sure how to compare them easily... maybe you have an idea? I dont expect we get exact matching results (based on the summaries) but would be good to know how much different. And anyway, I think this difference then does not come from the switch to "use prokka" vs "skip prokka" but are intrinsic to the current method implementation. |
Thanks for testing! Will look into the CSVs. When you provided the annotation yourself, were these GFF files produced by Prokka too? Im just wondering if the difference in the final result is because of different input annotations or as you said some non-deterministic effect in e.g. the ILPs. |
I first run RIBAP with Prokka. And then I run it a second time without Prokka but giving the Prokka files from the previous run as input. So the Prokka files (and gene IDs) should be exactly the same |
I will merge this now but would still consider this a wip/testing feature. |
Add new parameters
annotation_file
andprotein_fasta_file
that both have to be provided to skip the default annotation withProkka
. Instead the provided annotationGFF
and translated CDS proteinFAA
are used for the downstream pipeline.Here are some example
GFF
andFAA
files that are based on the test data set included in./data
.ribap_test_data.zip
After unzipping them to
./data
the new parameters can be tested as follows:I also changed the code of the
roary
andstrain_ids
processes to accept the.gff3
file extension as it is used by e.g.Bakta
.As mentioned in #35, it might be interesting to introduce some
GFF
format check with this option to make sure the downstream processes don't crash. I tried this with genomeTools gffValidator but I think this might introduce unnecessary issues. For example the defaultBakta GFF3
annotation files run through the pipeline without any issues with the new parameters, but thegffValidator
raises an error on the 4th line already. So we might need some other solution or even some custom script checking for the parts of theGFF
file that we actually need later on. Just as an idea though.Since I ran into a similar issue as reported in #58 while testing I also added as small check that sets the chunk size to 1 if
blastTable
has smaller length than #cores or there are no target strains different from the query strains, resulting in a crash of the script. I think this only happens when running with a very small number of genomes like 2-3 so maybe not so relevant but the user now gets a small error message instead of the python 'out of range' error.closes #35