Snakemake pipeline to handle the installation and running of all pipelines in the pandora paper. It also produces all plots in the paper.
The version used in the pandora paper review has tag pandora_paper_update_31_03_2021
.
The version used in the pandora paper submission has tag pandora_paper_tag1
.
If you are interested only on the paper plots package, it can be viewed here or downloaded here.
- python 3.6+;
- singularity 2.4.1+;
cd installation_pipeline && ./setup.sh
git checkout pandora_paper_update_31_03_2021
scripts/submit_lsf.sh <input_data_folder> <pipeline_output>
git checkout pandora_paper_update_31_03_2021
scripts/submit_lsf_4_way.sh <input_data_folder> <pipeline_output>
Scripts scripts/package_output.sh
and scripts/package_output_4way.sh
package the 20- and 4-way results, respectively,
computed with the previous commands. It creates two zip packages, pandora1_paper_analysis_output_20_way.zip
and
pandora1_paper_analysis_output_4_way.zip
, respectively, containing the data that is needed to reproduce the paper plots
(and some other additional data).
The two previously described packages were already pre-computed and made available. To download, extract, and use them to create the plots in the paper, please do:
git checkout pandora_paper_update_31_03_2021
cd scripts/csv_to_paper_plots && ./generate_paper_plots.sh
The pandora1_paper_analysis_output_20_way.zip
package contains data that
can be reprocessed and replotted. Besides the data, some plots are also included to give a general sense/idea of the data,
and how the plots could look like, but it is by no means the best visualisation of the data.
In what follows is a short description of each file and its fields, to help on the understanding of the files in these
packages.
In many plots regarding recall, we have different ways to measure recall. It is easier to explain them using examples.
Suppose we have three genes, each with one SNP between them. The first gene is rare, present in 2/20 genomes.
The second gene is at an intermediate frequency, in 10/20 genomes.
The third is a strict core gene, present in all genomes.
The SNP in the first gene has alleles A
, C
at 50% frequency (1 A
and 1 C
).
The SNP in the second gene has alleles G
, T
at 50% frequency (5 G
s and 5 T
s).
The SNP in the third gene has alleles A
, T
with 15 A
s and 5 T
s.
Suppose a variant caller found the SNP in the first gene, detecting the two correct alleles.
For the second gene’s SNP, it detected only one G
and one T
, failing to detect either allele in the other 8 genomes.
For the third gene's SNP, it detected all the 5 T
s, but no A
. Here, the:
-
Total Allelic Recall (
TAR
, referenced asrecall_wrt_truth_probes
in the files) would benumber of alleles found / number of total alleles = (2+2+5)/(2+10+20) = 0.28
. The main issue withTAR
is that core variants contribute to a lot more alleles, and thus weight, in the recall calculation than rare variants. For example, if we are in a panel with 20 genomes, one core variant can have the same weight as 10 rare variants. This means that our recall measurement is biased towards tools that recover core variants well. For this reason, this measure is generally not considered in the paper, except when it is needed to compute recall restricted to single samples (which is the only measure defined in this case). To deal with this core variant bias, we have two other measures. -
Pan-Variant Recall (
PVR
, referenced asrecall_wrt_variants_where_all_allele_seqs_were_found
in the files) would be:(1 + 1 + 0) / 3 = 0.66
- i.e. score a1
if both alleles are found, irrespective of how often, and0
otherwise. This measure gives the same weight for core and rare variants. -
Average Allelic Recall (
AvgAR
, referenced asrecall_wrt_variants_found_wrt_alleles
) would be:(2/2 + 2/10 + 5/20) / 3 = 0.48
- i.e. for each variant, thenumber of alleles found / total number of alleles
. This is a balance between the two previous measures. It does not overweight core variants (all variants have a recall value in[0, 1]
), and it rewards tools finding more alleles of a variant.
For more details, please see the paper manuscript.
These folders/files refer to sequencing-technology-dependent analyses, present in the folders
illumina_analysis
, illumina_analysis_with_filters
, nanopore_analysis
, and nanopore_analysis_with_filters
.
For details about the filters applied in these results, see the Methods section in the paper manuscript.
Provides for each tool and sample, the error rate the pipeline computed for that tool on that sample. Can be mainly used to check if there is a sample with issue (i.e. a sample where all tools consistently perform bad). These values are not filtered by genotype confidence.
Main file: enrichment_of_FPs.csv
Preview:
GT,step_GT,precision,error_rate,nb_of_correct_calls,nb_of_total_calls,sample,tool,coverage,coverage_threshold,strand_bias_threshold,gaps_threshold
0,0,0.994254362790382,0.00574563720961796,173504.3460874612,174507.0,063_STEC,pandora_illumina_nodenovo_global_genotyping,100x,0,0.0,1.0
0,0,0.9936651659229959,0.006334834077003859,184938.97335125625,186118.0,CFT073,pandora_illumina_nodenovo_global_genotyping,100x,0,0.0,1.0
0,0,0.9941746924963222,0.005825307503677824,198630.1385126102,199794.0,Escherichia_coli_MINF_1A,pandora_illumina_nodenovo_global_genotyping,100x,0,0.0,1.0
Fields explanation:
Important fields:
precision
anderror_rate
: the precision and error rate of the tool for that particular sample;sample
andtool
: the sample and tool took into consideration;
Other fields:
GT
andstep_GT
: ignore, not relevant here;nb_of_correct_calls
andnb_of_total_calls
: raw values used to computeprecision
anderror_rate
;coverage,coverage_threshold,strand_bias_threshold,gaps_threshold
: details about the run;
Provides data about the amount of FP, TP, FN, and TN loci found by pandora.
Main files:
gene_classification.csv
: contains the counts of FP, TP, FN, and TN loci (columns) for pandora in all samples;gene_classification_by_sample.csv
: contains the counts of FP, TP, FN, and TN loci (columns) for pandora in each sample (rows);gene_classification_by_gene_length.csv
: contains the counts of FP, TP, FN, and TN loci (1st column) split by locus length (remaining columns), from locus up to 100, 200, 300, ... 4100+ bps. The header is composed of the 3 first lines (a pandas multirow header);gene_classification_by_gene_length_normalised.csv
: the same asgene_classification_by_gene_length.csv
, but with proportions instead of counts;
Provides data about the edit distance between locus sequences in the references (pandora inferred VCF-reference + the other 24 references) and in the 20 samples.
Main file:
gene_sample_ref_ED_nbsamples_zam.csv
Preview:
gene_name,sample,ref,edit_distance,edit_distance_labels,nb_of_samples
GC00003501,063_STEC,NC_011993.1,0.010101010101010102,0.02,20
GC00003501,CFT073,NC_011993.1,0.0,0.01,20
GC00003501,Escherichia_coli_MINF_1A,NC_011993.1,0.060606060606060615,0.07,20
GC00003501,Escherichia_coli_MINF_1D,NC_011993.1,0.0505050505050505,0.060000000000000005,20
Fields explanation:
gene_name
,sample
,ref
,edit_distance
: self explanatory;nb_of_samples
: the number of samples the given gene was identified in;edit_distance_labels
: ignore;
Shows a breakdown of the precision of each tool (snippy, samtools, medaka, etc) against pandora. We have one plot for each reference the tool used, and the precision on each of the samples. The background colour is coloured by the reference clade. Pandora precision is shown as lines. There is one csv for each tool.
Main files: precision_per_ref_per_clade_{tool}_pandora.csv
, tool == snippy
or samtools
or medaka
or nanopolish
Preview:
tool,sample,precision,ref,tool_and_ref
Medaka,Escherichia_coli_MINF_1A,0.831789544979302,C4 (CP010121.1),Medaka / C4 (CP010121.1)
Pandora illumina no denovo,063_STEC,0.994254362790382,PRG,Pandora illumina no denovo / PRG
Fields explanation:
tool,sample,precision,ref
: self explanatory;tool_and_ref
: tool and ref concatenated
Same as precision_per_ref_per_clade, but for recall instead of precision.
The recall considered here is the Total Allelic Recall (recall_wrt_truth_probes
), as it is the only defined for a single sample.
Main files: recall_per_ref_per_clade_{tool}_pandora.csv
, tool == snippy
or samtools
or medaka
or nanopolish
Similar to recall_per_ref_per_clade, but we have data split by frequency (or number of samples) the
allele of the pangenome variant is in. The field nb_of_samples
represent this.
Main files: recall_per_ref_per_nb_of_samples_per_clade.{tool}_pandora.nb_of_samples_{nb}.csv
, tool == snippy
or samtools
or medaka
or nanopolish
, nb
ranges from 2 to 20;
The input csv for this (precision_per_sample.tsv
) is actually the same as enrichment_of_FPs.
Just the plots are different. Could well be merged.
Main file: precision_per_sample.tsv
Similar to precision_per_sample. It features all the three types of recall discussed on The different types of recall. These values are not filtered by genotype confidence.
Main file: recall_per_sample.tsv
Preview:
GT step_GT recalls_wrt_truth_probes nbs_of_truth_probes_found nbs_of_truth_probes_in_total recalls_wrt_variants_where_all_allele_seqs_were_found recalls_wrt_variants_found_wrt_alleles nbs_variants_where_all_allele_seqs_were_found nbs_variants_found_wrt_alleles nbs_variants_total tool coverage coverage_threshold strand_bias_threshold gaps_threshold sample
0 0 0.8659094229759785 320028 369586 0.0 0.06964684242136669 0 25740.497903143234 369586 samtools_NZ_CP018109.1 100x 0 Not_App Not_App Escherichia_coli_MINF_9A
0 0 0.955989627165643 335104 350531 0.0 0.06855860932737412 0 24031.849327524455 350530 samtools_NC_007779.1 100x 0 Not_App Not_App H131800734
0 0 0.8728209202393062 325193 372577 0.0 0.06779792683112228 0 25259.609195324887 372572 snippy_NC_022648.1 100x 0 Not_App Not_App Escherichia_coli_MSB1_4I
Fields explanation:
Important fields:
recalls_wrt_truth_probes, recalls_wrt_variants_where_all_allele_seqs_were_found, recalls_wrt_variants_found_wrt_alleles
: the three recall values;tool
andsample
: the tool and the sample;
Other fields:
GT
andstep_GT
: ignore, not relevant here;coverage coverage_threshold, strand_bias_threshold, gaps_threshold
: the filtering options;nbs_of_truth_probes_found, nbs_of_truth_probes_in_total, nbs_variants_where_all_allele_seqs_were_found, nbs_variants_found_wrt_alleles, nbs_variants_total
: values used to compute the 3 recalls;
Data about the recall of each tool split by the pangenome variants frequency (nb of samples a pangenome variant is in).
The recalls considered here are: i) Pan-Variant Recall (PVR
, recall_wrt_variants_where_all_allele_seqs_were_found
);
ii) Average Allelic Recall (AvgAR
, recall_wrt_variants_found_wrt_alleles
).
Main files:
recall_per_nb_of_samples_pvr.plot_data.csv
(forPVR
);recall_per_nb_of_samples_avgar.plot_data.csv
(forAvgAR
);
Preview (for PVR
, AvgAR
is similar):
coverage,tool,coverage_threshold,strand_bias_threshold,gaps_threshold,NB_OF_SAMPLES,recall,recall_PVR,NUMBER_OF_SAMPLES,total_nb_of_PanVar,nb_of_found_PanVar,cumulative_nb_of_found_PanVar,colour
100x,pandora_illumina_nodenovo_global_genotyping,0,0.0,1.0,2,,0.4584300168900254,2,63351,29041.999999999996,29041.999999999996,orange
100x,pandora_illumina_withdenovo_global_genotyping,0,0.0,1.0,2,,0.4929203958895677,2,63351,31227.000000000004,31227.000000000004,blue
100x,snippy_NC_011993.1,0,Not_App,Not_App,2,,0.30694069549020536,2,63351,19445.0,19445.0,"rgba(255,0,0,0.3)"
Fields explanation:
Important fields:
tool
: the considered tool;recall_PVR
: thePVR
;NB_OF_SAMPLES
orNUMBER_OF_SAMPLES
: represent the recall at the given frequency. Duplicated fields;nb_of_found_PanVar
: the number of pangenomes variations found. Can be used to do the plot with absolute counts;cumulative_nb_of_found_PanVar
: the number of pangenomes variations found cumulatively (i.e. withNB_OF_SAMPLES
less than the number of samples of this record). Can be used to do the plot with cumulative absolute counts;
Other fields:
total_nb_of_PanVar
: total number of pangenome variations at that frequency;coverage,coverage_threshold,strand_bias_threshold,gaps_threshold
: filtering options;colour
: ignore;recall
: empty field, to be removed;
For AvgAR
, we have recall_AvgAR
instead of recall_PVR
.
The ROC data for the considered tools
Preview:
tool coverage coverage_threshold strand_bias_threshold gaps_threshold step_GT error_rate nb_of_correct_calls nb_of_total_calls recalls_wrt_truth_probes nbs_of_truth_probes_found nbs_of_truth_probes_in_total recalls_wrt_variants_where_all_allele_seqs_were_found recalls_wrt_variants_found_wrt_alleles nbs_variants_where_all_allele_seqs_were_found nbs_variants_found_wrt_alleles nbs_variants_total
0 pandora_illumina_nodenovo_global_genotyping 100x 0 0.0 1.0 0 0.0059087704272498 3857627.6795571432 3880557.0 0.8938190838656083 6208302 6945815 0.7444157818552333 0.8497949421959974 460276 525432.4617344962 618305
1 pandora_illumina_nodenovo_global_genotyping 100x 0 0.0 1.0 1 0.005903728676744491 3857100.4914081306 3880007.0 0.8937897705294634 6206385 6943898 0.7441683311634226 0.8494682071001245 460123 525230.4397910425 618305
2 pandora_illumina_nodenovo_global_genotyping 100x 0 0.0 1.0 2 0.005896919341595419 3856687.518114524 3879565.0 0.8937732947457525 6205308 6942821 0.7438966205998657 0.8492857257272203 459955 525117.6106457689 618305
Fields explanation:
Important fields:
tool,coverage
: the considered tool and the read coverage we used;error_rate,recalls_wrt_truth_probes,recalls_wrt_variants_where_all_allele_seqs_were_found,recalls_wrt_variants_found_wrt_alleles
: the error rate and the 3 recall measures;coverage_threshold,strand_bias_threshold,gaps_threshold
: the filters that were applied;
Other fields:
step_GT
: the genotype filtering step used. The lower (higher) the value, the more lenient (strict) the GT filtering was. 0 is unfiltered, and the max value is the max filtering we set for the pipeline;nb_of_correct_calls,nb_of_total_calls,nbs_of_truth_probes_found,nbs_of_truth_probes_in_total,nbs_variants_where_all_allele_seqs_were_found,nbs_variants_found_wrt_alleles,nbs_variants_total
: fields used to compute the error rate and the recalls;
These folders/files refer to sequencing-technology-independent analyses, present in the folder technology_independent_analysis
.
Provides data to check how many pangenome variants are in 2, 3, 4, ... samples.
Main file: pangenome_variations_per_nb_of_samples.csv
Preview:
PVID,NUMBER_OF_SAMPLES
0,2
1,4
2,2
3,20
Fields explanation:
PVID
: the identifier of the pangenome variation;NUMBER_OF_SAMPLES
: the number of samples the specified pangenome variation is in;
Provides data to build the 2-SNP heatmap (sharing of variants present in precisely 2 genomes).
Main file: two_SNP_heatmap_count_df.csv
Preview:
FIRST_SAMPLE,SECOND_SAMPLE,count
063_STEC,063_STEC,0
063_STEC,CFT073,4138
063_STEC,H131800734,67
063_STEC,MINF_1A,1964
Fields explanation:
count
is the number of pangenome variants with frequency 2 found between the FIRST_SAMPLE
and SECOND_SAMPLE
.