theiagen · sage-wright · Nov 6, 2024 · Aug 22, 2024 · Aug 22, 2024 · Aug 22, 2024
@@ -169,6 +169,32 @@ For all cases:
 
     `Snippy_Variants` aligns reads for each sample against the reference genome. As part of `Snippy_Streamline`, the only output from this workflow is the `snippy_variants_outdir_tarball` which is provided in the set-level data table. Please see the full documentation for [Snippy_Variants](./snippy_variants.md) for more information.
 
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ??? task "Snippy_Tree workflow"
 
     ##### Snippy_Tree {#snippy_tree}
@@ -188,6 +214,7 @@ For all cases:
 | snippy_centroid_version | String | Centroid version used |
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
 | snippy_concatenated_variants | File | The concatenated variants file |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
 | snippy_final_tree | File | Final phylogenetic tree produced by Snippy_Streamline |

@@ -37,6 +37,34 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
 
     **If reference genomes have multiple contigs, they will not be compatible with using Gubbins** to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought.
 
+### Workflow Tasks
+
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ### Inputs
 
 | **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
@@ -117,6 +145,7 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
 | snippy_centroid_samplename | String | Name of the centroid sample |
 | snippy_centroid_version | String | Centroid version used |
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_concatenated_variants | File | The concatenated variants file |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |

@@ -306,12 +306,39 @@ Sequencing data used in the Snippy_Tree workflow must:
         | Task | task_shared_variants.wdl |
         | Software Source Code | [task_shared_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/phylogenetic_inference/utilities/task_shared_variants.wdl) |
 
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ### Outputs
 
 | **Variable** | **Type** | **Description** |
 |---|---|---|
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
 | snippy_concatenated_variants | File | Concatenated snippy_results file across all samples in the set |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
 | snippy_final_tree | File | Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false) |

@@ -58,6 +58,13 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
 
 `Snippy_Variants` uses the snippy tool to align reads to the reference and call SNPs, MNPs and INDELs according to optional input parameters. The output includes a file of variants that is then queried using the `grep` bash command to identify any mutations in specified genes or annotations of interest. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the `snippy_results` column.
 
+Additionally, `Snippy_Variants` extracts quality control (QC) metrics from the Snippy output for each sample. These per-sample QC metrics are saved in TSV files (`snippy_variants_qc_metrics`). The QC metrics include:
+
+- **Percentage of reads aligned to the reference genome** (`snippy_variants_percent_reads_aligned`).
+- **Percentage of the reference genome covered at or above the specified depth threshold** (`snippy_variants_percent_ref_coverage`).
+
+These per-sample QC metrics can be combined into a single file (`snippy_combined_qc_metrics`) in downstream workflows, such as `snippy_tree_wf`, providing an overview of QC metrics across all samples.
+
 ### Outputs
 
 !!! tip "Visualize your outputs in IGV"

@@ -89,19 +89,52 @@ task snippy_variants {
     if [ "$reference_length" -eq 0 ]; then
       echo "Could not compute percent reference coverage: reference length is 0" > PERCENT_REF_COVERAGE
     else
-      # compute percent reference coverage
-      echo $reference_length_passed_depth $reference_length | awk '{ print ($1/$2)*100 }' > PERCENT_REF_COVERAGE
+      echo $reference_length_passed_depth $reference_length | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_REF_COVERAGE
     fi
 
     # Compute percentage of reads aligned
     reads_aligned=$(cat READS_ALIGNED_TO_REFERENCE)
     total_reads=$(samtools view -c "~{samplename}/~{samplename}.bam")
+    echo $total_reads > TOTAL_READS
     if [ "$total_reads" -eq 0 ]; then
       echo "Could not compute percent reads aligned: total reads is 0" > PERCENT_READS_ALIGNED
     else
-      echo $reads_aligned $total_reads | awk '{ print ($1/$2)*100 }' > PERCENT_READS_ALIGNED
+      echo $reads_aligned $total_reads | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_READS_ALIGNED
     fi
 
+    # Create QC metrics file
+    line_count=$(wc -l < "~{samplename}/~{samplename}_coverage.tsv")
+    # Check the number of lines in the coverage file, to consider scenarios e.g. for V. cholerae that has two chromosomes and therefore coverage metrics per chromosome
+    if [ "$line_count" -eq 2 ]; then
+      head -n 1 "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_HEADER
+      sed -n '2p' "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_VALUES
+    elif [ "$line_count" -gt 2 ]; then
+      # Multiple chromosomes (header + multiple data lines)
+      header=$(head -n 1 "~{samplename}/~{samplename}_coverage.tsv")
+      output_header=""
+      output_values=""
+      # while loop to iterate over each line in the coverage file
+      while read -r line; do
+        if [ -z "$output_header" ]; then
+          output_header="$header"
+          output_values="$line"
+        else
+          output_header="$output_header\t$header"
+          output_values="$output_values\t$line"
+        fi
+      done < <(tail -n +2 "~{samplename}/~{samplename}_coverage.tsv")
+      echo "$output_header" | tr ' ' '\t' > COVERAGE_HEADER
+      echo "$output_values" | tr ' ' '\t' > COVERAGE_VALUES
+    else
+      # Coverage file has insufficient data
+      echo "Coverage file has insufficient data." > COVERAGE_HEADER
+      echo "" > COVERAGE_VALUES
+    fi
+
+    # Build the QC metrics file
+    echo -e "samplename\treads_aligned_to_reference\ttotal_reads\tpercent_reads_aligned\tvariants_total\tpercent_ref_coverage\t$(cat COVERAGE_HEADER)" > "~{samplename}/~{samplename}_qc_metrics.tsv"
+    echo -e "~{samplename}\t$reads_aligned\t$total_reads\t$(cat PERCENT_READS_ALIGNED)\t$(cat VARIANTS_TOTAL)\t$(cat PERCENT_REF_COVERAGE)\t$(cat COVERAGE_VALUES)" >> "~{samplename}/~{samplename}_qc_metrics.tsv"
+
   >>>
   output {
     String snippy_variants_version = read_string("VERSION")
@@ -120,6 +153,7 @@ task snippy_variants {
     String snippy_variants_ref_length = read_string("REFERENCE_LENGTH")
     String snippy_variants_ref_length_passed_depth = read_string("REFERENCE_LENGTH_PASSED_DEPTH")
     String snippy_variants_percent_ref_coverage = read_string("PERCENT_REF_COVERAGE")
+    File snippy_variants_qc_metrics = "~{samplename}/~{samplename}_qc_metrics.tsv"
     String snippy_variants_percent_reads_aligned = read_string("PERCENT_READS_ALIGNED")
   }
   runtime {

@@ -50,7 +50,8 @@ workflow snippy_streamline {
       tree_name = tree_name_updated,
       snippy_variants_outdir_tarball = snippy_variants_wf.snippy_variants_outdir_tarball,
       samplenames = samplenames,
-      reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta])
+      reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta]),
+      snippy_variants_qc_metrics = snippy_variants_wf.snippy_variants_qc_metrics
   }
   call versioning.version_capture {
     input:
@@ -111,5 +112,6 @@ workflow snippy_streamline {
     File? snippy_filtered_metadata = snippy_tree_wf.snippy_filtered_metadata
     File? snippy_concatenated_variants = snippy_tree_wf.snippy_concatenated_variants
     File? snippy_shared_variants_table = snippy_tree_wf.snippy_shared_variants_table
+    File? snippy_combined_qc_metrics = snippy_tree_wf.snippy_combined_qc_metrics
   }
 }
@@ -48,7 +48,8 @@ workflow snippy_streamline_fasta {
       tree_name = tree_name_updated,
       snippy_variants_outdir_tarball = snippy_variants_wf.snippy_variants_outdir_tarball,
       samplenames = samplenames,
-      reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta])
+      reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta]),
+      snippy_variants_qc_metrics = snippy_variants_wf.snippy_variants_qc_metrics
   }
   call versioning.version_capture {
     input:
@@ -106,5 +107,6 @@ workflow snippy_streamline_fasta {
     File? snippy_filtered_metadata = snippy_tree_wf.snippy_filtered_metadata
     File? snippy_concatenated_variants = snippy_tree_wf.snippy_concatenated_variants
     File? snippy_shared_variants_table = snippy_tree_wf.snippy_shared_variants_table
+    File? snippy_combined_qc_metrics = snippy_tree_wf.snippy_combined_qc_metrics
   }
 }
@@ -23,6 +23,7 @@ workflow snippy_tree_wf {
     Boolean use_gubbins = true
     Boolean core_genome = true
     Boolean call_shared_variants = true
+    Array[File]? snippy_variants_qc_metrics
 
     String? data_summary_terra_project
     String? data_summary_terra_workspace
@@ -186,6 +187,14 @@ workflow snippy_tree_wf {
         concatenated_file_name = tree_name_updated
     }
   }
+  if (defined(snippy_variants_qc_metrics)) {
+    call file_handling.cat_files as concatenate_qc_metrics {
+      input:
+        files_to_cat = select_first([snippy_variants_qc_metrics]),
+        concatenated_file_name = tree_name_updated + "_combined_qc_metrics.tsv",
+        skip_extra_headers = true
+    }
+  }
   call versioning.version_capture {
     input:
   }
@@ -233,5 +242,8 @@ workflow snippy_tree_wf {
     # shared snps outputs
     File? snippy_concatenated_variants = concatenate_variants.concatenated_variants
     File? snippy_shared_variants_table = shared_variants.shared_variants_table
+
+    # combined qc metrics
+    File? snippy_combined_qc_metrics = concatenate_qc_metrics.concatenated_files
   }
 }
@@ -70,6 +70,7 @@ workflow snippy_variants_wf {
     File snippy_variants_coverage_tsv = snippy_variants.snippy_variants_coverage_tsv
     Int snippy_variants_num_variants = snippy_variants.snippy_variants_num_variants
     Float snippy_variants_percent_ref_coverage = snippy_variants.snippy_variants_percent_ref_coverage
+    File snippy_variants_qc_metrics = snippy_variants.snippy_variants_qc_metrics
     Float snippy_variants_percent_reads_aligned  = snippy_variants.snippy_variants_percent_reads_aligned
     # snippy gene query outputs
     String? snippy_variants_query = snippy_gene_query.snippy_variants_query