AlexsLemonade · jharenza · Sep 23, 2019 · Sep 23, 2019 · Sep 23, 2019 · Sep 23, 2019
diff --git a/content/03.methods.md b/content/03.methods.md
@@ -64,14 +64,26 @@ Alignments were futher processed using following the Broad Institute's Best Prac
 Duplicates were marked using Samblaster[@doi:10/f6kft3] v0.1.24, BAMs merged and sorted using Sambamba [@doi:10/gfzsfw] v0.6.3.
 Lastly, resultant BAMs were processing using Broad's Genome Analysis Tool Kit (GATK) [@url:https://software.broadinstitute.org/gatk/] v4.0.3.0, BaseRecalibrator submodule.
 
-### Germ Line Single Nucleotide Variant Calling
+### Quality Control of Sequencing Data
+NGSCheckmate [doi:10.1093/nar/gkx193] was peformed on matched tumor/normal crams to confirm sample matches and remove mis-matched samples from the dataset. 
+Cram inputs were preprocessed using bcftools to filter and call snps using default parameters[@url:https://github.com/parklab/NGSCheckMate] and the resulting VCFs were used to run NGSCheckmate using [this workflow](https://github.com/d3b-center/ngs_checkmate_wf) in the D3b GitHub repository. 
+Per author guidelines, <= 0.61 was used as a correlation coefficient cutoff at sequencing depths >10 to predict mismatched samples.
 
 ### Somatic Single Nucleotide Variant Calling
 #### SNV and INDEL calling
 
-We used Strelka2 [@doi:10/gdwrp4] v2.9.3 and Mutect2 from GATK v4.1.1.0.
-Strelka2 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the author.
-Mutect2 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].  
+We used four variant callers to call SNVs and INDELS from Panel, WXS, and WGS data: Strelka2, Mutect2, Lancet, and Vardict.
+The same input interval BED files were used for both panel and WXS data.
+Strelka2 [@doi:10/gdwrp4] v2.9.3 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the authors.
+The final Strelka2 VCF was filtered for PASS variants.
+Mutect2 from GATK v4.1.1.0 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].
+The final Mutect2 VCF was filtered for PASS variants. 
+Lancet [@doi:10.1038/s42003-018-0023-9] v1.0.7 [@url:https://github.com/nygenome/lancet].
+For input intervals, a reference BED was created by using only the UTR, exome, and start/stop codon features of the GENOCODE 31 reference.  
+Per recommendations by the New York Genome Center, the calling input intervals were augmented with PASS variant calls from Strelka2 and Mutect2 as validation.
+VardictJava [@doi:10.1093/nar/gkw227] v1.58 [@url:https://github.com/AstraZeneca-NGS/VarDictJava] was run using the hg38 fasta reference with the same BED intervals used for Mutect2.  
+Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) >= 0.05 (instead of >= 0.10) were retained.  
+The final VCF was filtered for PASS variants with TYPE=StronglySomatic.
 
 #### VCF annotation and MAF creation
 
@@ -88,6 +100,7 @@ We used Manta SV [@doi:10/gf3ggb] v1.4.0 for structural variant (SV) calls.
 Manta SV calling was also limited to regions used in Strelka2.
 We also ran LUMPY SV [@doi:10/gf3ggc] v0.2.13 in express mode using default parameters. 
 The hg38 reference used was also limited to canonical chromosome regions.
+The [somatic DNA workflow](https://github.com/kids-first/kf-somatic-workflow) for SNV, INDEL, copy number, and SV calling can be found in the D3b Github repository.
 
 ### Gene Expression Abundance Estimation
 We used STAR [@doi:10/f4h523] v2.6.1d to align paired-end RNA-seq reads.
@@ -104,6 +117,7 @@ For both these tools we used aligned BAM and chimeric SAM files from STAR as inp
 We ran STAR-Fusion with default parameters and annotated all fusion calls with GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz provided in the STAR-fusion release. 
 For Arriba, we used a blacklist file (blacklist_hg38_GRCh38_2018-11-04.tsv.gz) from the Arriba release tarballs to remove recurrent fusion artifacts and transcripts present in healthy tissue.
 We also provided Arriba with strandedness information or set it to auto-detection for polyA samples.
+The [RNA expression and fusion workflows](https://github.com/kids-first/kf-rnaseq-workflow) can be found in the D3b GitHub repository.
 
 #### Fusion prioritization
 
@@ -117,6 +131,18 @@ We annotated putative driver fusions and prioritized fusions lists with kinases,
 We also added chimerDB [@doi:10.1093/nar/gkw1083] annotations to both driver and prioritized fusion list.
 
 ### Clinical Data Harmonization
+#### WHO Classification of Disease Types
+
+The `disease_type_old` field in the `pbta-histologies.tsv` file contains the diagnosis denoted from the patient's pathology report.
+The `disease_type_new` field in the `pbta-histologies.tsv` file includes updates to `disease_type_old`, including any diagnosis denoted as "Other" as well as a changes on the basis of a molecular alteration.
+The `broad_histology` denotes the broad 2016 WHO classification [doi:10.1007/s00401-016-1545-1] for each tumor.
+The `short_histology` is an abbreviated version of the `broad_histology`.
+
+#### Molecular Subtyping
+Medulloblastoma subtypes SHH, MYC, Group 3, and Group 4 were predicted using an [RNA expression classifier](https://github.com/PichaiRaman/MedulloClassifier) on the RSEM FPKM data.
+
+#### Survival
+Overall survival was calculated as days since initial diagnosis.
 
 #### Prediction of participants' genetic sex