Skip to content

Commit

Permalink
Update to v0.12.9
Browse files Browse the repository at this point in the history
  • Loading branch information
ctsa committed Jan 10, 2025
1 parent 6a86094 commit 4b0a447
Show file tree
Hide file tree
Showing 17 changed files with 328 additions and 118 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Change Log

## v0.12.9 - 2025-01-10

### Added

- CR-418 Provide BAM output of assembled SV contig alignments in the joint-call step
- BAM output reflects all assembled SV haplotypes used during genotyping, which can be useful for reviewing SV calls

## v0.12.8 - 2024-12-13

### Fixed
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sawfish"
version = "0.12.8"
version = "0.12.9"
authors = ["Chris Saunders <csaunders@pacificbiosciences.com>"]
description = "Structural variant analysis for mapped PacBio HiFi reads"
edition = "2021"
Expand Down
2 changes: 1 addition & 1 deletion LICENSE-THIRDPARTY.json
Original file line number Diff line number Diff line change
Expand Up @@ -1702,7 +1702,7 @@
},
{
"name": "sawfish",
"version": "0.12.8",
"version": "0.12.9",
"authors": "Chris Saunders <csaunders@pacificbiosciences.com>",
"repository": null,
"license": null,
Expand Down
54 changes: 45 additions & 9 deletions docs/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,8 @@ Just as in the single-sample case, note that the reference fasta and all 3 sampl

### Joint call step

The primary user output of the sawfish SV caller is the SV VCF produced by the joint-call step. This file lists all SVs in VCF 4.2 format. Details of the SV representation in this file are provided below.
The primary output of the joint-calling step is the genotyped SV calls for all samples in VCF 4.2 format, written to `${OUTPUT_DIR}/genotyped.sv.vcf.gz`.
Details of the SV representation in this file are provided below.

#### Quality scores

Expand All @@ -121,7 +122,7 @@ The following filters may be applied to each VCF record:
- `MaxScoringDepth` - Read depth at the SV locus exceeds 1000x, so all scoring and genotyping steps were disabled.
- `InvBreakpoint` - This breakpoint is represented as part of a separate VCF inversion record (the inversion record shares the same EVENT ID)

#### SV Types
#### SV types

Notes on formatting and representation of SVs are listed below for each major type.

Expand Down Expand Up @@ -149,7 +150,7 @@ When an inversion is found, a VCF record will be output using the `<INV>` symbol

#### Overlapping SV formatting

All sawfish SVs are output so that only one allele is described in each VCF record, even if an overlapping SV allele is output at the same locus. The internal SV calling model accounts for up to 2 overlapping alleles per sample during genotyping and quality scoring, reads supporting a 2nd alternate allele at any given locus will be counted as support the reference in output fields such as allele depth (`AD`). This protocol matches standard SV caller formatting conventions. Users interested in a more detailed output format, such as representing overlapping read support on the VCF `<*>` allele can request this for prioritization.
All sawfish SVs are output so that only one allele is described in each VCF record, even if an overlapping SV allele is output at the same locus. The internal SV calling model accounts for up to 2 overlapping alleles per sample during genotyping and quality scoring. Reads which support a 2nd alternate allele at any given locus will be counted as supporting the reference in output fields such as allele depth (`AD`). This protocol matches standard SV caller formatting conventions. Users interested in a more detailed output format, such as representing overlapping read support on the VCF `<*>` allele can request this for prioritization.

#### Phasing

Expand Down Expand Up @@ -185,18 +186,53 @@ Note that the number of read QNAME entries should often match the supporting AD

### Discover step

The discover step produces a number of output files in the discover output directory used by sawfish during the subsequent joint calling step. Although these are not intended for direct use, some of the important files are described here:
The discover step produces a number of output files in the discover output directory used by sawfish during the subsequent joint calling step. Although these are not intended for direct use, some of the important files are summarized here:

- `assembly.regions.bed` - Describes each region of the genome targeted for assembly.
- `contig.alignment.bam` - This is a BAM file containing the SV locus contigs aligned back to the genome to create candidate SVs for each sample.
- `candidate.sv.bcf` - These are the candidate SVs expressed in a simplified format for each sample. These are used as input for joint genotyping together with the aligned candidate contigs.
- `discover.settings.json` - Various parameters from the discover step (either user input or default) are recorded in this file. Some of the paths to files like the sample bam and reference fasta will be reused in the joint call step.

### Debug outputs
### Debug outputs from either step

In either run step, the following files are produced to help debug problematic runs:
1. `${OUTPUT_DIR}/sawfish.log` - High level logging output
2. `${OUTPUT_DIR}/run_stats.json` - Run statistics and component timings
In either run step, the following files are produced to help debug problematic runs or SV calls:

- `${OUTPUT_DIR}/sawfish.log` - High level logging output
- `${OUTPUT_DIR}/run_stats.json` - Run statistics and component timings
- `${OUTPUT_DIR}/contig.alignment.bam` - Contigs for assembled SV haplotypes aligned back to the reference genome

#### Contig alignment format

SV haplotype contig alignments are output to `${OUTPUT_DIR}/contig.alignment.bam` in either the discover or joint-call steps, and can be useful for reviewing SV calls. For instance,
this file can be viewed in alignment browsers such as IGV.

Aligned contigs are provided for all single breakpoint SV calls. To find the contig for a given SV, locate the SV's VCF ID field, such as `sawfish:0:2803:1:2`,
and take the prefix from this ID that includes the first three digits, in this case `sawfish:0:2803:1` -- this is `QNAME` value of the corresponding SV
haplotype alignment(s) in the contig alignment BAM file.

Contigs are not available for multi-breakpoint events such as inversions. However, contigs are available for each individual breakpoint
comprising the event.

In addition to standard sequence and alignment information, each contig BAM record includes a custom aux field called `sf` which provides a list of key/value
properties associated with the contig, for instance:

sf:Z:n_reads:15;hq_range:1500-2311;

The properties are:

- `n_reads` - The number or reads used to assemble the contig
- `hq_range` - The high-quality assembled region of the contig, prior to appending any flanking read sequence anchors.

An example contig alignment bam record is:

```
sawfish:0:92:0 0 chr1 1649635 20 211=1X108=2I38=1X201=1X86=20I89=1I12=1X281=6D17=1X65=1D22=1X359=1X129=1X154=53D99=1X210=1X140=1X134=1X73=1X55=1X400=1X186=1X133=1X233=1X44=1D74=1635D9= * 0 0 TCCCTAATGAGAAATAAAGTGTCATGCAAAGAAACCTCACTTCAAAAATTTCACATGAAGCCGGGCACGGAGGCTTATGCCTGTAATCCTAGCACTTTGGGAGGCTGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCAAGGCCATCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGCGTGGTGGCAGACACCTGTAATCCCAGCTACTAAGCGAGGCTGAGGCAGAAGAATTGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCACGCCACTGCACTACAGCCTGGGCAAAAAAAAAAAAAAAAAACCCACGTGAAACTGAAATTAAGGCCGGGCGCGGTGGCTCACGCCTGTAATTCCAGCACTCTGGGAGGCCGAGGTGGGCGGATCACAAGGTCAGATCGGGACCATCCTGGCTAACACGGTGAAACCCCATCTCTACTAAAAATACAAAAAATTAGCTGGGTGTGGTGGCGGGCACCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCGGGAGAATAGCGTGAACCCGGGAGATGGAATTTGCAGTGAGCTGAGATTGCGCCACTGTACTCCAGCCTGGGTGACAAGCAAGACTCCGTCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGAAATTAAATCAAGAACAGTAAATATTTAAATAAATATTTAAATAATGATGTTAACGTTAAGTAATCCTAATTTTTCTTTTTTTTCTTTTTTTTTTTTTTGAGATGGAGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCCTCCCGTGTTCACACCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGTACTACAGGCGCCTGCCACCACGCCCGGCTAATTTTTTGTCTTTTTAGTAGAGACGGGGTTCCACCATGTTAGCCAGGATGGTCTTGATGTTCTGACCTCGTGATCTGCCGGCCTCGGCCTCCCAAAGTGTGGGGATTACAGTTGTGAGCCACCGCGCCCGGCCTTTTTTTTTTTTTTTTTAAAAGAGACAGGGTCTCGCTATATTGGCCAGGCTGGTCTTGAACTCCCGGACTCAATTGATCCTCCAAGTGCTGGGATTACAGGCCTGAGCCACTGCACCCAGCCGAATAATCATGATTTTATGTTAAATAAAAAACTTTGAAAATAAAAAACTATCTGCAGTAAGCGTCTAATTATGAAGAAAGAAGAAAAAAGAAAAACAATTCTGCTATCACAGAAGAGCAGAATTGTAATATTCGTCTTTTAAAACTTTTCCATACTGAATAAACTATAATTATCAGTTTTATAATACAAAAATCACTCTTCACAAAGACTACAGAACAAAGCTTTGCTATCAGTGGGCTTCTCCACTGTGCAATGAAGCCACATTAATTAATCAAGTGTATTTATAATCATGACATTTCAATCGGGCTCCAGGTCCAATTTTCCTAACACCCGTAAGAACCTCTTGATGTTGGTACGAGATCAAACTGCTCAAGCCAAACCCATTCTTTGGACTTGAGCAAATACCCATTTTGGGGTGTGTTTTTCTCCTATACTTGTTGAATTCAGGTCATTTTAAATGTAAACAAACTGCTCCCAAACAATATAATGGGGGAGAGAAAACCCCAGAGGAAAAATGGACTAGCCATTCTGAATGGTCTGTGACACACGCACGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCGGTCTGTGATACACGCATGCTTTCAGCTGGAGTTTGCTCTCTGTAGCCCCTCTGAATGGTCTGTGACACATGCACGCTTTCAGCTAGAGTACTCTCTCTATAGCCCTTCTGAATGGTCTGTGACACACGCATGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCGGTCTGGGACACATGCATGCTTTTAGCTAGAGTTTGCTCTGTATAGCCCTTCTGAACGGTCTGTGACACACGCATGCTTTCAGCTGGAGTTTGCTCTCTATAGCCCCTCTGAATGGTCTGTGACACACGCATGCTTTCAGATAGAGTATTCTCTCTATAGCCCTTCTGAATGGTCTGTAACACACGCAAGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTTGGTCTGTGACACACGCATGCTTTTAGCTAGAGTTTGCTCTGTATAGCCCTTCTGAATGGTCTGTGACACATGCATGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCAGTCTGTGACACACACATGCTTTTAGCTAGAGTTTGCTCTGTATAGCCCTTCTGAATGGTCTGTGACACACGCGTGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTTGGTCTGTGACACACGCATGCTTTTAGCTAGTTTGCTCTCATAGCCCTTCTGAACGGTCTGTGACACATGCATGCTTTCAGCTATTCTCTCTATAGCCATTGTGAATGGTCTGTGACACACGCACGCTTTCAGCTAGAGTTTGCTCTTTCTGGTTTTTGGTCTGTGACACACGCATGCTTTCAGCTAGAGTTTGCTCTCTCTGGTTTTCGGTCTGTGACGCACGCATGCTTTTAGCTAGAGTATTCTCTCTATAGCCATTCTGAACGGTCTGTGACACACGTATGCTTTCAGCTAGAGTTTGCTTTCTCTGGTTTTTCAGTGGTGCTCTGGGGAAGGCAGAAGAGTAGGAACAGGAAAGAAACCACACTTGAACATGATGTCAAAGAAAGTAAATGCTTCTGTACCCCCTTCTGCTGAATGGCTACGATGCCTACGTTTCTCTTTTCTCTTTTCATCTTTTCTGTGATGAGCTTTTTCTTTCCGAGACATTTGCTGGGGTGGTTTGATGGCCAAAGAATCATCTTCTTCTCCTCTGAAATAAAACACAACAGCACTGCGTCATGCTTGAGAAAGTGCAAAGAAGCATCAGGCTATTATAAGGTTTCTTCAACCCAGAAAAATGCATGATTCAGACAGGAACAAAGCTGAAACATCATTTAAAAAATTACATTAATTCTCCAACTTTAGGCATCTTTTTTTTCTTTTTTTCTTTTTTTTAGACAGTCTCGCTCTGTTGCCCGGGCTGTAGTGGCACGATCTCGGCTCACTGCAATCTCCACCCTCCGGGTTCATGCCATTCTCTTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCGCCACGCTGGCTAATTTTTGTATTTTTAGTAGAGATGGGGTTTTACCATGTTAGCCAGGATGGTCTTGGTCTCCTGACCTCATGATCCGCCCACCTCGGTCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACTGCGCCCGGCCTGTATTTATTTTTTTGAGACGGAGTCTCGCTCTGTTGCCCAGGCTGGAATGCAGTGGTACGATCTCGGCTCATTGCAACCTCCCCTTCCAGTCCCAGGTTCAAGCAATTCTCCTGCCTCTGCCTCAGGAGTAGCTGGGATTACAGGCATGCGCCACCACACCCGGCTAATTTTTTATTTTTAGTAGAGACGGGGTTTCACCATATTGGTCAGGCTGGTCTCAAACTTGTGACATCATGATCCACCCACCTCG * sf:Z:n_reads:12;hq_range:1500-2103;
```

In this case, the VCF record for the corresponding SV call generated from the contig is:

```
chr1 1651421 sawfish:0:92:0:0 GTAGCCCCTCTGAACGGTCTGTGACACACGCATGCTTTCAGCTAGAGTACTCTA G 504 PASS SVTYPE=DEL;END=1651474;SVLEN=-53;HOMLEN=13;HOMSEQ=TAGCCCCTCTGAA GT:GQ:PL:AD 0/1:387:537,0,387:9,12
```

## Usage Details

Expand Down
8 changes: 8 additions & 0 deletions src/cli/joint_call.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ pub struct JointCallSettings {
///
#[arg(long)]
pub report_supporting_reads: bool,

/// Disable filtering out unused contigs and trimming alignment artifacts in
/// the aligned contig output.
///
/// This is only intended for internal debugging cases
///
#[arg(hide = true, long)]
pub no_contig_cleanup: bool,
}

/// Validate settings and update to parameters that can't be processed automatically by clap.
Expand Down
72 changes: 43 additions & 29 deletions src/contig_output.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ impl std::fmt::Debug for ContigAlignmentSegment {

/// Assembly contig alignment information
///
#[derive(Clone)]
pub struct ContigAlignmentInfo {
/// Contig source sample
pub sample_index: usize,
Expand Down Expand Up @@ -100,15 +101,15 @@ fn get_sa_tag_segment(
)
}

pub fn convert_contig_alignment_to_bam_record(
fn convert_contig_alignment_to_bam_record(
chrom_list: &ChromList,
upper_pkg_name: &str,
pkg_name: &str,
contig_alignment: ContigAlignmentInfo,
) -> bam::Record {
const CONTIG_MAPQ: u8 = 20;
let qname = format!(
"{}_contig:{}:{}:{}",
upper_pkg_name,
"{}:{}:{}:{}",
pkg_name,
contig_alignment.sample_index,
contig_alignment.cluster_index,
contig_alignment.assembly_index
Expand All @@ -122,6 +123,7 @@ pub fn convert_contig_alignment_to_bam_record(
contig_alignment.seq.as_slice(),
&quals,
);
record.unset_unmapped();
record.set_tid(this_segment.tid);
record.set_pos(this_segment.pos);
record.set_mapq(CONTIG_MAPQ);
Expand Down Expand Up @@ -170,48 +172,66 @@ pub fn convert_contig_alignment_to_bam_record(
record
}

/// Create bam header for the sawfish contig alignments
///
/// This is a simple header containing just the contig info, and the sawfish commandline as a "PG" entry
///
fn get_contig_alignment_header(chrom_list: &ChromList, pkg_name: &str) -> bam::header::Header {
let mut new_header = bam::header::Header::new();

let mut hd_record = bam::header::HeaderRecord::new(b"HD");
hd_record.push_tag(b"VN", "1.6");
hd_record.push_tag(b"SO", "coordinate");
new_header.push_record(&hd_record);

for chrom_info in chrom_list.data.iter() {
let mut sq_record = bam::header::HeaderRecord::new(b"SQ");
sq_record.push_tag(b"SN", &chrom_info.label);
sq_record.push_tag(b"LN", chrom_info.length);
new_header.push_record(&sq_record);
}

let cmdline = std::env::args().collect::<Vec<_>>().join(" ");
let mut pg_record = bam::header::HeaderRecord::new(b"PG");
pg_record.push_tag(b"PN", pkg_name);
pg_record.push_tag(b"ID", format!("{pkg_name}-{PROGRAM_VERSION}"));
pg_record.push_tag(b"VN", PROGRAM_VERSION);
pg_record.push_tag(b"CL", &cmdline);

new_header.push_record(&pg_record);
new_header
}

/// Completes writing (and closing) the contig bam file itself
///
fn write_contig_alignments_bam(
filename: &Path,
thread_count: usize,
bam_header: bam::HeaderView,
chrom_list: &ChromList,
contig_alignments: Vec<ContigAlignmentInfo>,
) {
let pkg_name = env!("CARGO_PKG_NAME");

// Setup new bam writer for debug output:
let mut bam_writer = {
let mut output_bam_header: bam::header::Header =
bam::header::Header::from_template(&bam_header);
let cmdline = std::env::args().collect::<Vec<_>>().join(" ");

let mut pg_record = bam::header::HeaderRecord::new(b"PG");
pg_record.push_tag(b"PN", pkg_name);
pg_record.push_tag(b"ID", format!("{pkg_name}-{PROGRAM_VERSION}"));
pg_record.push_tag(b"VN", PROGRAM_VERSION);
pg_record.push_tag(b"CL", &cmdline);
output_bam_header.push_record(&pg_record);

let output_bam_header = get_contig_alignment_header(chrom_list, pkg_name);
bam::Writer::from_path(filename, &output_bam_header, bam::Format::Bam).unwrap()
};
bam_writer.set_threads(thread_count).unwrap();

let upper_pkg_name = pkg_name.to_uppercase();

for contig_alignment in contig_alignments.into_iter() {
let record =
convert_contig_alignment_to_bam_record(chrom_list, &upper_pkg_name, contig_alignment);
let record = convert_contig_alignment_to_bam_record(chrom_list, pkg_name, contig_alignment);
bam_writer.write(&record).unwrap();
}
}

/// The bam_header should not already have a sawfish PG entry, to avoid conflicting PG entires which cause problems in IGV
///
pub fn write_contig_alignments(
output_dir: &Path,
thread_count: usize,
bam_header: bam::HeaderView,
chrom_list: &ChromList,
contig_alignment_type_label: &str,
mut contig_alignments: Vec<ContigAlignmentInfo>,
) {
// Sort contig alignments
Expand All @@ -224,17 +244,11 @@ pub fn write_contig_alignments(
let filename = output_dir.join(discover::CONTIG_ALIGNMENT_FILENAME);

info!(
"Writing contig alignments to bam file: '{}'",
"Writing {contig_alignment_type_label} contig alignments to bam file: '{}'",
filename.display()
);

write_contig_alignments_bam(
&filename,
thread_count,
bam_header,
chrom_list,
contig_alignments,
);
write_contig_alignments_bam(&filename, thread_count, chrom_list, contig_alignments);

// Index bam file
//
Expand Down
22 changes: 20 additions & 2 deletions src/joint_call/get_refined_svs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,11 @@ fn get_high_quality_contig_range(record: &bam::Record) -> Option<IntRange> {
x
}

/// Parse BAM record from sawfish contig alignment file into the sample_assembles structure to use in haplotype merging and joint GT steps
/// Parse BAM record from sawfish contig alignment file into the sample_assembles structure to use in haplotype merging
/// and joint GT steps
///
/// Results are accumulated in the sample_assemblies structure. The primary index for this struct is cluster_index, so
/// that it can easily be looked up from any SV cluster_index.
///
fn update_sample_assemblies_from_bam_record(
chrom_list: &ChromList,
Expand All @@ -94,7 +98,9 @@ fn update_sample_assemblies_from_bam_record(
let words = qname.split(':').collect::<Vec<_>>();
assert!(words.len() >= 4);

// Ignore sample index
// Ignore sample index. The joint call routine will add its own sample indexes based on the input order
// of the discover directories
//
//let sample_index = words[1].parse::<usize>().unwrap();

let cluster_index = words[2].parse::<usize>().unwrap();
Expand Down Expand Up @@ -126,6 +132,7 @@ fn update_sample_assemblies_from_bam_record(
chrom_index: record.tid() as usize,
contig_alignment,
high_quality_contig_range: high_quality_contig_range.clone(),
is_fwd_strand: true, // The contig output format requires fwd-strand for the primary alignment
}
};

Expand Down Expand Up @@ -155,6 +162,7 @@ fn update_sample_assemblies_from_bam_record(
chrom_index: segment.chrom_index,
contig_alignment,
high_quality_contig_range,
is_fwd_strand: segment.is_fwd_strand,
};
cluster_assembly
.contig_alignments
Expand Down Expand Up @@ -269,6 +277,11 @@ fn parse_bnd_alt_allele(chrom_list: &ChromList, alt_allele: &[u8], homlen: i64)
/// RefinedSV plus auxiliary data which will be added to the SVGroup
struct AnnotatedRefinedSV {
refined_sv: RefinedSV,

/// Assembly index or indexes used to associate the contig back to the SV call
///
/// This provides a way to clarify which contigs are associated with the variant in overlapping contig regions.
///
contig_map: Vec<usize>,
}

Expand Down Expand Up @@ -491,6 +504,10 @@ fn process_bcf_record_to_anno_refine_sv(
}
}

/// Attempt to convert the full set of AnnotatedRefinedSVs from the same cluster_index into a single-sample SVGroup
///
/// The group_arsvs will always be cleared by this method.
///
fn process_rsv_cluster_to_sv_group(
contigs: &SampleAssemblies,
expected_copy_number_regions: Option<&GenomeRegionsByChromIndex>,
Expand Down Expand Up @@ -649,6 +666,7 @@ fn process_rsv_cluster_to_sv_group(
}
}

/// Take a set of annotated refined SVs from the same cluster, and try to convert these into an SVGroup
fn process_rsv_cluster(
contigs: &SampleAssemblies,
expected_copy_number_regions: Option<&GenomeRegionsByChromIndex>,
Expand Down
Loading

0 comments on commit 4b0a447

Please sign in to comment.