Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
Merge pull request #44 from AlexsLemonade/update-methods
Browse files Browse the repository at this point in the history
add-lancet-vardict-ngscheck-links-clin-harm
  • Loading branch information
Jo Lynne authored Sep 23, 2019
2 parents ac9b7cd + 08e0dd1 commit 4e0099d
Showing 1 changed file with 41 additions and 4 deletions.
45 changes: 41 additions & 4 deletions content/03.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,31 @@ Alignments were futher processed using following the Broad Institute's Best Prac
Duplicates were marked using Samblaster[@doi:10/f6kft3] v0.1.24, BAMs merged and sorted using Sambamba [@doi:10/gfzsfw] v0.6.3.
Lastly, resultant BAMs were processing using Broad's Genome Analysis Tool Kit (GATK) [@url:https://software.broadinstitute.org/gatk/] v4.0.3.0, BaseRecalibrator submodule.

### Germ Line Single Nucleotide Variant Calling
### Quality Control of Sequencing Data

NGSCheckmate [doi:10.1093/nar/gkx193] was peformed on matched tumor/normal CRAMs to confirm sample matches and remove mis-matched samples from the dataset.
CRAM inputs were preprocessed using bcftools to filter and call 20k common SNPs using default parameters[@url:https://github.com/parklab/NGSCheckMate] and the resulting VCFs were used to run NGSCheckmate using [this workflow](https://github.com/d3b-center/ngs_checkmate_wf) in the D3b GitHub repository.
Per author guidelines, <= 0.61 was used as a correlation coefficient cutoff at sequencing depths >10 to predict mismatched samples.
For RNA-Seq, read strandedness was determined by running the [`infer_experiment.py` script](http://rseqc.sourceforge.net/#infer-experiment-py) on the first 200k mapped reads.
If calculated strandedness did not match strandedness information received from the sequencing center, samples were removed from analysis.
We required at least 60% of RNA-Seq reads mapped to the human reference or samples were removed from analysis.

### Somatic Single Nucleotide Variant Calling
#### SNV and INDEL calling

We used Strelka2 [@doi:10/gdwrp4] v2.9.3 and Mutect2 from GATK v4.1.1.0.
Strelka2 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the author.
Mutect2 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].
We used four variant callers to call SNVs and INDELS from targeted DNA panel, WXS, and WGS data: Strelka2, Mutect2, Lancet, and VarDict.
The same input interval BED files were used for both panel and WXS data.
Strelka2 [@doi:10/gdwrp4] v2.9.3 was run using default parameters on human genome reference hg38, canonical chromosomes only (chr1-22, X,Y,M), as recommended by the authors.
The final Strelka2 VCF was filtered for PASS variants.
Mutect2 from GATK v4.1.1.0 was run following Broad best practices outlined from their Workflow Description Language (WDL) [@url:https://github.com/broadinstitute/gatk/blob/4.1.1.0/scripts/mutect2_wdl/mutect2.wdl].
The final Mutect2 VCF was filtered for PASS variants.
Lancet [@doi:10.1038/s42003-018-0023-9] v1.0.7 [@url:https://github.com/nygenome/lancet] was run using default parameters, unless noted below.
For input intervals to Lancet, a reference BED was created by using only the UTR, exome, and start/stop codon features of the GENCODE 31 reference.
Per recommendations by the New York Genome Center, the Lancet input intervals were augmented with PASS variant calls from Strelka2 and Mutect2 as validation.
VarDictJava [@doi:10.1093/nar/gkw227] v1.58 [@url:https://github.com/AstraZeneca-NGS/VarDictJava] was run using the hg38 fasta reference with the same BED intervals used for Mutect2.
Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) >= 0.05 (instead of >= 0.10) were retained.
The 0.05 VAF increased the true positive rate for INDELs and decreased the false positive rate for SNVs when using VarDict in consensus calling.
The final VCF was filtered for PASS variants with TYPE=StronglySomatic.

#### VCF annotation and MAF creation

Expand All @@ -88,6 +105,7 @@ We used Manta SV [@doi:10/gf3ggb] v1.4.0 for structural variant (SV) calls.
Manta SV calling was also limited to regions used in Strelka2.
We also ran LUMPY SV [@doi:10/gf3ggc] v0.2.13 in express mode using default parameters.
The hg38 reference used was also limited to canonical chromosome regions.
The somatic DNA workflow for SNV, INDEL, copy number, and SV calling can be found in the [KidsFirst Github repository](https://github.com/kids-first/kf-somatic-workflow).

### Gene Expression Abundance Estimation
We used STAR [@doi:10/f4h523] v2.6.1d to align paired-end RNA-seq reads.
Expand All @@ -104,6 +122,7 @@ For both these tools we used aligned BAM and chimeric SAM files from STAR as inp
We ran STAR-Fusion with default parameters and annotated all fusion calls with GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz provided in the STAR-fusion release.
For Arriba, we used a blacklist file (blacklist_hg38_GRCh38_2018-11-04.tsv.gz) from the Arriba release tarballs to remove recurrent fusion artifacts and transcripts present in healthy tissue.
We also provided Arriba with strandedness information or set it to auto-detection for polyA samples.
The RNA expression and fusion workflows can be found in the [KidsFirst GitHub repository](https://github.com/kids-first/kf-rnaseq-workflow).

#### Fusion prioritization

Expand All @@ -118,6 +137,24 @@ We also added chimerDB [@doi:10.1093/nar/gkw1083] annotations to both driver and

### Clinical Data Harmonization

#### WHO Classification of Disease Types

The `disease_type_old` field in the `pbta-histologies.tsv` file contains the diagnosis denoted from the patient's pathology report.
The `disease_type_new` field in the `pbta-histologies.tsv` file includes updates to `disease_type_old` and these changes are documented in the `Notes`.
For instance, any diagnosis denoted as "Other" in `disease_type_old` was modified to capture the pathology report diagnosis in `disease_type_new`.
Additionally, `disease_type_old` was modified to `disease_type_new` if the presence of specific molecular alterations defined a biospecimen as having an alternate diagnosis.
The `broad_histology` denotes the broad 2016 WHO classification [doi:10.1007/s00401-016-1545-1] for each tumor.
The `short_histology` is an abbreviated version of the `broad_histology`.

#### Molecular Subtyping

The `molecular_subtype` column in the `pbta-histologies.tsv` file contains molecular subtype information derived as described below.
Medulloblastoma subtypes SHH, MYC, Group 3, and Group 4 were predicted using an [RNA expression classifier](https://github.com/PichaiRaman/MedulloClassifier) on the RSEM FPKM data.

#### Survival

Overall survival was calculated as days since initial diagnosis.

#### Prediction of participants' genetic sex

The clinical metadata provided included a reported gender.
Expand Down

0 comments on commit 4e0099d

Please sign in to comment.