Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add-lancet-vardict-ngscheck-links-clin-harm #44

Merged
merged 12 commits into from
Sep 23, 2019
34 changes: 30 additions & 4 deletions content/03.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,26 @@ Alignments were futher processed using following the Broad Institute's Best Prac
Duplicates were marked using Samblaster[@doi:10/f6kft3] v0.1.24, BAMs merged and sorted using Sambamba [@doi:10/gfzsfw] v0.6.3.
Lastly, resultant BAMs were processing using Broad's Genome Analysis Tool Kit (GATK) [@url:https://software.broadinstitute.org/gatk/] v4.0.3.0, BaseRecalibrator submodule.

### Germ Line Single Nucleotide Variant Calling
### Quality Control of Sequencing Data
NGSCheckmate [doi:10.1093/nar/gkx193] was peformed on matched tumor/normal crams to confirm sample matches and remove mis-matched samples from the dataset.
Cram inputs were preprocessed using bcftools to filter and call snps using default parameters[@url:https://github.com/parklab/NGSCheckMate] and the resulting VCFs were used to run NGSCheckmate using [this workflow](https://github.com/d3b-center/ngs_checkmate_wf) in the D3b GitHub repository.
Per author guidelines, <= 0.61 was used as a correlation coefficient cutoff at sequencing depths >10 to predict mismatched samples.

### Somatic Single Nucleotide Variant Calling
#### SNV and INDEL calling

We used Strelka2 [@doi:10/gdwrp4] v2.9.3 and Mutect2 from GATK v4.1.1.0.
Strelka2 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the author.
Mutect2 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].
We used four variant callers to call SNVs and INDELS from Panel, WXS, and WGS data: Strelka2, Mutect2, Lancet, and Vardict.
The same input interval BED files were used for both panel and WXS data.
Strelka2 [@doi:10/gdwrp4] v2.9.3 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the authors.
The final Strelka2 VCF was filtered for PASS variants.
Mutect2 from GATK v4.1.1.0 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].
The final Mutect2 VCF was filtered for PASS variants.
Lancet [@doi:10.1038/s42003-018-0023-9] v1.0.7 [@url:https://github.com/nygenome/lancet].
For input intervals, a reference BED was created by using only the UTR, exome, and start/stop codon features of the GENOCODE 31 reference.
Per recommendations by the New York Genome Center, the calling input intervals were augmented with PASS variant calls from Strelka2 and Mutect2 as validation.
VardictJava [@doi:10.1093/nar/gkw227] v1.58 [@url:https://github.com/AstraZeneca-NGS/VarDictJava] was run using the hg38 fasta reference with the same BED intervals used for Mutect2.
Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) >= 0.05 (instead of >= 0.10) were retained.
The final VCF was filtered for PASS variants with TYPE=StronglySomatic.

#### VCF annotation and MAF creation

Expand All @@ -88,6 +100,7 @@ We used Manta SV [@doi:10/gf3ggb] v1.4.0 for structural variant (SV) calls.
Manta SV calling was also limited to regions used in Strelka2.
We also ran LUMPY SV [@doi:10/gf3ggc] v0.2.13 in express mode using default parameters.
The hg38 reference used was also limited to canonical chromosome regions.
The [somatic DNA workflow](https://github.com/kids-first/kf-somatic-workflow) for SNV, INDEL, copy number, and SV calling can be found in the D3b Github repository.

### Gene Expression Abundance Estimation
We used STAR [@doi:10/f4h523] v2.6.1d to align paired-end RNA-seq reads.
Expand All @@ -104,6 +117,7 @@ For both these tools we used aligned BAM and chimeric SAM files from STAR as inp
We ran STAR-Fusion with default parameters and annotated all fusion calls with GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz provided in the STAR-fusion release.
For Arriba, we used a blacklist file (blacklist_hg38_GRCh38_2018-11-04.tsv.gz) from the Arriba release tarballs to remove recurrent fusion artifacts and transcripts present in healthy tissue.
We also provided Arriba with strandedness information or set it to auto-detection for polyA samples.
The [RNA expression and fusion workflows](https://github.com/kids-first/kf-rnaseq-workflow) can be found in the D3b GitHub repository.

#### Fusion prioritization

Expand All @@ -117,6 +131,18 @@ We annotated putative driver fusions and prioritized fusions lists with kinases,
We also added chimerDB [@doi:10.1093/nar/gkw1083] annotations to both driver and prioritized fusion list.

### Clinical Data Harmonization
#### WHO Classification of Disease Types

The `disease_type_old` field in the `pbta-histologies.tsv` file contains the diagnosis denoted from the patient's pathology report.
The `disease_type_new` field in the `pbta-histologies.tsv` file includes updates to `disease_type_old`, including any diagnosis denoted as "Other" as well as a changes on the basis of a molecular alteration.
The `broad_histology` denotes the broad 2016 WHO classification [doi:10.1007/s00401-016-1545-1] for each tumor.
The `short_histology` is an abbreviated version of the `broad_histology`.

#### Molecular Subtyping
Medulloblastoma subtypes SHH, MYC, Group 3, and Group 4 were predicted using an [RNA expression classifier](https://github.com/PichaiRaman/MedulloClassifier) on the RSEM FPKM data.

#### Survival
Overall survival was calculated as days since initial diagnosis.

#### Prediction of participants' genetic sex

Expand Down