Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Snippy_Variants QC outputs to Snippy_Tree and Snippy_Sreamline workflow outputs #592

Merged
merged 16 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions docs/workflows/phylogenetic_construction/snippy_streamline.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,32 @@ For all cases:

`Snippy_Variants` aligns reads for each sample against the reference genome. As part of `Snippy_Streamline`, the only output from this workflow is the `snippy_variants_outdir_tarball` which is provided in the set-level data table. Please see the full documentation for [Snippy_Variants](./snippy_variants.md) for more information.

??? task "snippy_variants (qc_metrics output)"

##### snippy_variants {#snippy_variants}

This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

**Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.

??? task "Snippy_Tree workflow"

##### Snippy_Tree {#snippy_tree}
Expand All @@ -188,6 +214,7 @@ For all cases:
| snippy_centroid_version | String | Centroid version used |
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | The concatenated variants file |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Final phylogenetic tree produced by Snippy_Streamline |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,34 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a

**If reference genomes have multiple contigs, they will not be compatible with using Gubbins** to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought.

### Workflow Tasks

??? task "snippy_variants (qc_metrics output)"

##### snippy_variants {#snippy_variants}

This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

**Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.

### Inputs

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
Expand Down Expand Up @@ -117,6 +145,7 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
| snippy_centroid_samplename | String | Name of the centroid sample |
| snippy_centroid_version | String | Centroid version used |
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_concatenated_variants | File | The concatenated variants file |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
Expand Down
27 changes: 27 additions & 0 deletions docs/workflows/phylogenetic_construction/snippy_tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,12 +306,39 @@ Sequencing data used in the Snippy_Tree workflow must:
| Task | task_shared_variants.wdl |
| Software Source Code | [task_shared_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/phylogenetic_inference/utilities/task_shared_variants.wdl) |

??? task "snippy_variants (qc_metrics output)"

##### snippy_variants {#snippy_variants}

This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:

- **samplename**: The name of the sample.
- **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
- **total_reads**: The total number of reads in the sample.
- **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
- **variants_total**: The total number of variants detected between the sample and the reference genome.
- **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
- **#rname**: Reference sequence name (e.g., chromosome or contig name).
- **startpos**: Starting position of the reference sequence.
- **endpos**: Ending position of the reference sequence.
- **numreads**: Number of reads covering the reference sequence.
- **covbases**: Number of bases with coverage.
- **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
- **meandepth**: Mean depth of coverage over the reference sequence.
- **meanbaseq**: Mean base quality over the reference sequence.
- **meanmapq**: Mean mapping quality over the reference sequence.

These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.

**Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.

### Outputs

| **Variable** | **Type** | **Description** |
|---|---|---|
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | Concatenated snippy_results file across all samples in the set |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false) |
Expand Down
7 changes: 7 additions & 0 deletions docs/workflows/phylogenetic_construction/snippy_variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,13 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f

`Snippy_Variants` uses the snippy tool to align reads to the reference and call SNPs, MNPs and INDELs according to optional input parameters. The output includes a file of variants that is then queried using the `grep` bash command to identify any mutations in specified genes or annotations of interest. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the `snippy_results` column.

Additionally, `Snippy_Variants` extracts quality control (QC) metrics from the Snippy output for each sample. These per-sample QC metrics are saved in TSV files (`snippy_variants_qc_metrics`). The QC metrics include:

- **Percentage of reads aligned to the reference genome** (`snippy_variants_percent_reads_aligned`).
- **Percentage of the reference genome covered at or above the specified depth threshold** (`snippy_variants_percent_ref_coverage`).

These per-sample QC metrics can be combined into a single file (`snippy_combined_qc_metrics`) in downstream workflows, such as `snippy_tree_wf`, providing an overview of QC metrics across all samples.

### Outputs

!!! tip "Visualize your outputs in IGV"
Expand Down
40 changes: 37 additions & 3 deletions tasks/gene_typing/variant_detection/task_snippy_variants.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -89,19 +89,52 @@ task snippy_variants {
if [ "$reference_length" -eq 0 ]; then
echo "Could not compute percent reference coverage: reference length is 0" > PERCENT_REF_COVERAGE
else
# compute percent reference coverage
echo $reference_length_passed_depth $reference_length | awk '{ print ($1/$2)*100 }' > PERCENT_REF_COVERAGE
echo $reference_length_passed_depth $reference_length | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_REF_COVERAGE
fi

# Compute percentage of reads aligned
reads_aligned=$(cat READS_ALIGNED_TO_REFERENCE)
total_reads=$(samtools view -c "~{samplename}/~{samplename}.bam")
echo $total_reads > TOTAL_READS
if [ "$total_reads" -eq 0 ]; then
echo "Could not compute percent reads aligned: total reads is 0" > PERCENT_READS_ALIGNED
else
echo $reads_aligned $total_reads | awk '{ print ($1/$2)*100 }' > PERCENT_READS_ALIGNED
echo $reads_aligned $total_reads | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_READS_ALIGNED
fi

# Create QC metrics file
line_count=$(wc -l < "~{samplename}/~{samplename}_coverage.tsv")
# Check the number of lines in the coverage file, to consider scenarios e.g. for V. cholerae that has two chromosomes and therefore coverage metrics per chromosome
if [ "$line_count" -eq 2 ]; then
head -n 1 "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_HEADER
sed -n '2p' "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_VALUES
elif [ "$line_count" -gt 2 ]; then
# Multiple chromosomes (header + multiple data lines)
header=$(head -n 1 "~{samplename}/~{samplename}_coverage.tsv")
output_header=""
output_values=""
# while loop to iterate over each line in the coverage file
while read -r line; do
if [ -z "$output_header" ]; then
output_header="$header"
output_values="$line"
else
output_header="$output_header\t$header"
output_values="$output_values\t$line"
fi
done < <(tail -n +2 "~{samplename}/~{samplename}_coverage.tsv")
echo "$output_header" | tr ' ' '\t' > COVERAGE_HEADER
echo "$output_values" | tr ' ' '\t' > COVERAGE_VALUES
else
# Coverage file has insufficient data
echo "Coverage file has insufficient data." > COVERAGE_HEADER
echo "" > COVERAGE_VALUES
fi

# Build the QC metrics file
echo -e "samplename\treads_aligned_to_reference\ttotal_reads\tpercent_reads_aligned\tvariants_total\tpercent_ref_coverage\t$(cat COVERAGE_HEADER)" > "~{samplename}/~{samplename}_qc_metrics.tsv"
echo -e "~{samplename}\t$reads_aligned\t$total_reads\t$(cat PERCENT_READS_ALIGNED)\t$(cat VARIANTS_TOTAL)\t$(cat PERCENT_REF_COVERAGE)\t$(cat COVERAGE_VALUES)" >> "~{samplename}/~{samplename}_qc_metrics.tsv"

>>>
output {
String snippy_variants_version = read_string("VERSION")
Expand All @@ -120,6 +153,7 @@ task snippy_variants {
String snippy_variants_ref_length = read_string("REFERENCE_LENGTH")
String snippy_variants_ref_length_passed_depth = read_string("REFERENCE_LENGTH_PASSED_DEPTH")
String snippy_variants_percent_ref_coverage = read_string("PERCENT_REF_COVERAGE")
File snippy_variants_qc_metrics = "~{samplename}/~{samplename}_qc_metrics.tsv"
String snippy_variants_percent_reads_aligned = read_string("PERCENT_READS_ALIGNED")
}
runtime {
Expand Down
4 changes: 3 additions & 1 deletion workflows/phylogenetics/wf_snippy_streamline.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ workflow snippy_streamline {
tree_name = tree_name_updated,
snippy_variants_outdir_tarball = snippy_variants_wf.snippy_variants_outdir_tarball,
samplenames = samplenames,
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta])
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta]),
snippy_variants_qc_metrics = snippy_variants_wf.snippy_variants_qc_metrics
}
call versioning.version_capture {
input:
Expand Down Expand Up @@ -111,5 +112,6 @@ workflow snippy_streamline {
File? snippy_filtered_metadata = snippy_tree_wf.snippy_filtered_metadata
File? snippy_concatenated_variants = snippy_tree_wf.snippy_concatenated_variants
File? snippy_shared_variants_table = snippy_tree_wf.snippy_shared_variants_table
File? snippy_combined_qc_metrics = snippy_tree_wf.snippy_combined_qc_metrics
}
}
4 changes: 3 additions & 1 deletion workflows/phylogenetics/wf_snippy_streamline_fasta.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ workflow snippy_streamline_fasta {
tree_name = tree_name_updated,
snippy_variants_outdir_tarball = snippy_variants_wf.snippy_variants_outdir_tarball,
samplenames = samplenames,
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta])
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta]),
snippy_variants_qc_metrics = snippy_variants_wf.snippy_variants_qc_metrics
}
call versioning.version_capture {
input:
Expand Down Expand Up @@ -106,5 +107,6 @@ workflow snippy_streamline_fasta {
File? snippy_filtered_metadata = snippy_tree_wf.snippy_filtered_metadata
File? snippy_concatenated_variants = snippy_tree_wf.snippy_concatenated_variants
File? snippy_shared_variants_table = snippy_tree_wf.snippy_shared_variants_table
File? snippy_combined_qc_metrics = snippy_tree_wf.snippy_combined_qc_metrics
}
}
12 changes: 12 additions & 0 deletions workflows/phylogenetics/wf_snippy_tree.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ workflow snippy_tree_wf {
Boolean use_gubbins = true
Boolean core_genome = true
Boolean call_shared_variants = true
Array[File]? snippy_variants_qc_metrics

String? data_summary_terra_project
String? data_summary_terra_workspace
Expand Down Expand Up @@ -186,6 +187,14 @@ workflow snippy_tree_wf {
concatenated_file_name = tree_name_updated
}
}
if (defined(snippy_variants_qc_metrics)) {
call file_handling.cat_files as concatenate_qc_metrics {
input:
files_to_cat = select_first([snippy_variants_qc_metrics]),
concatenated_file_name = tree_name_updated + "_combined_qc_metrics.tsv",
skip_extra_headers = true
}
}
call versioning.version_capture {
input:
}
Expand Down Expand Up @@ -233,5 +242,8 @@ workflow snippy_tree_wf {
# shared snps outputs
File? snippy_concatenated_variants = concatenate_variants.concatenated_variants
File? snippy_shared_variants_table = shared_variants.shared_variants_table

# combined qc metrics
File? snippy_combined_qc_metrics = concatenate_qc_metrics.concatenated_files
}
}
1 change: 1 addition & 0 deletions workflows/standalone_modules/wf_snippy_variants.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ workflow snippy_variants_wf {
File snippy_variants_coverage_tsv = snippy_variants.snippy_variants_coverage_tsv
Int snippy_variants_num_variants = snippy_variants.snippy_variants_num_variants
Float snippy_variants_percent_ref_coverage = snippy_variants.snippy_variants_percent_ref_coverage
File snippy_variants_qc_metrics = snippy_variants.snippy_variants_qc_metrics
Float snippy_variants_percent_reads_aligned = snippy_variants.snippy_variants_percent_reads_aligned
# snippy gene query outputs
String? snippy_variants_query = snippy_gene_query.snippy_variants_query
Expand Down