diff --git a/content/03.methods.md b/content/03.methods.md index 34a12b40..12ebd360 100644 --- a/content/03.methods.md +++ b/content/03.methods.md @@ -64,14 +64,31 @@ Alignments were futher processed using following the Broad Institute's Best Prac Duplicates were marked using Samblaster[@doi:10/f6kft3] v0.1.24, BAMs merged and sorted using Sambamba [@doi:10/gfzsfw] v0.6.3. Lastly, resultant BAMs were processing using Broad's Genome Analysis Tool Kit (GATK) [@url:https://software.broadinstitute.org/gatk/] v4.0.3.0, BaseRecalibrator submodule. -### Germ Line Single Nucleotide Variant Calling +### Quality Control of Sequencing Data + +NGSCheckmate [doi:10.1093/nar/gkx193] was peformed on matched tumor/normal CRAMs to confirm sample matches and remove mis-matched samples from the dataset. +CRAM inputs were preprocessed using bcftools to filter and call 20k common SNPs using default parameters[@url:https://github.com/parklab/NGSCheckMate] and the resulting VCFs were used to run NGSCheckmate using [this workflow](https://github.com/d3b-center/ngs_checkmate_wf) in the D3b GitHub repository. +Per author guidelines, <= 0.61 was used as a correlation coefficient cutoff at sequencing depths >10 to predict mismatched samples. +For RNA-Seq, read strandedness was determined by running the [`infer_experiment.py` script](http://rseqc.sourceforge.net/#infer-experiment-py) on the first 200k mapped reads. +If calculated strandedness did not match strandedness information received from the sequencing center, samples were removed from analysis. +We required at least 60% of RNA-Seq reads mapped to the human reference or samples were removed from analysis. ### Somatic Single Nucleotide Variant Calling #### SNV and INDEL calling -We used Strelka2 [@doi:10/gdwrp4] v2.9.3 and Mutect2 from GATK v4.1.1.0. -Strelka2 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the author. -Mutect2 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl]. +We used four variant callers to call SNVs and INDELS from targeted DNA panel, WXS, and WGS data: Strelka2, Mutect2, Lancet, and VarDict. +The same input interval BED files were used for both panel and WXS data. +Strelka2 [@doi:10/gdwrp4] v2.9.3 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the authors. +The final Strelka2 VCF was filtered for PASS variants. +Mutect2 from GATK v4.1.1.0 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl]. +The final Mutect2 VCF was filtered for PASS variants. +Lancet [@doi:10.1038/s42003-018-0023-9] v1.0.7 [@url:https://github.com/nygenome/lancet] was run using default parameters, unless noted below. +For input intervals to Lancet, a reference BED was created by using only the UTR, exome, and start/stop codon features of the GENCODE 31 reference. +Per recommendations by the New York Genome Center, the Lancet input intervals were augmented with PASS variant calls from Strelka2 and Mutect2 as validation. +VarDictJava [@doi:10.1093/nar/gkw227] v1.58 [@url:https://github.com/AstraZeneca-NGS/VarDictJava] was run using the hg38 fasta reference with the same BED intervals used for Mutect2. +Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) >= 0.05 (instead of >= 0.10) were retained. +The 0.05 VAF increased the true positive rate for INDELs and decreased the false positive rate for SNVs when using VarDict in consensus calling. +The final VCF was filtered for PASS variants with TYPE=StronglySomatic. #### VCF annotation and MAF creation @@ -88,6 +105,7 @@ We used Manta SV [@doi:10/gf3ggb] v1.4.0 for structural variant (SV) calls. Manta SV calling was also limited to regions used in Strelka2. We also ran LUMPY SV [@doi:10/gf3ggc] v0.2.13 in express mode using default parameters. The hg38 reference used was also limited to canonical chromosome regions. +The somatic DNA workflow for SNV, INDEL, copy number, and SV calling can be found in the [KidsFirst Github repository](https://github.com/kids-first/kf-somatic-workflow). ### Gene Expression Abundance Estimation We used STAR [@doi:10/f4h523] v2.6.1d to align paired-end RNA-seq reads. @@ -104,6 +122,7 @@ For both these tools we used aligned BAM and chimeric SAM files from STAR as inp We ran STAR-Fusion with default parameters and annotated all fusion calls with GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz provided in the STAR-fusion release. For Arriba, we used a blacklist file (blacklist_hg38_GRCh38_2018-11-04.tsv.gz) from the Arriba release tarballs to remove recurrent fusion artifacts and transcripts present in healthy tissue. We also provided Arriba with strandedness information or set it to auto-detection for polyA samples. +The RNA expression and fusion workflows can be found in the [KidsFirst GitHub repository](https://github.com/kids-first/kf-rnaseq-workflow). #### Fusion prioritization @@ -118,6 +137,24 @@ We also added chimerDB [@doi:10.1093/nar/gkw1083] annotations to both driver and ### Clinical Data Harmonization +#### WHO Classification of Disease Types + +The `disease_type_old` field in the `pbta-histologies.tsv` file contains the diagnosis denoted from the patient's pathology report. +The `disease_type_new` field in the `pbta-histologies.tsv` file includes updates to `disease_type_old` and these changes are documented in the `Notes`. +For instance, any diagnosis denoted as "Other" in `disease_type_old` was modified to capture the pathology report diagnosis in `disease_type_new`. +Additionally, `disease_type_old` was modified to `disease_type_new` if the presence of specific molecular alterations defined a biospecimen as having an alternate diagnosis. +The `broad_histology` denotes the broad 2016 WHO classification [doi:10.1007/s00401-016-1545-1] for each tumor. +The `short_histology` is an abbreviated version of the `broad_histology`. + +#### Molecular Subtyping + +The `molecular_subtype` column in the `pbta-histologies.tsv` file contains molecular subtype information derived as described below. +Medulloblastoma subtypes SHH, MYC, Group 3, and Group 4 were predicted using an [RNA expression classifier](https://github.com/PichaiRaman/MedulloClassifier) on the RSEM FPKM data. + +#### Survival + +Overall survival was calculated as days since initial diagnosis. + #### Prediction of participants' genetic sex The clinical metadata provided included a reported gender.