Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Added methods for sex determination tool #40

Merged
merged 15 commits into from
Sep 16, 2019
13 changes: 13 additions & 0 deletions content/03.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,16 @@ We annotated putative driver fusions and prioritized fusions lists with kinases,
We also added chimerDB [@doi:10.1093/nar/gkw1083] annotations to both driver and prioritized fusion list.

### Clinical Data Harmonization

#### Prediction of participants' sex

The clinical metadata provided included a reported gender.
We used genetic data, in concert with the reported gender, to predict participant sex so that we could identify sexually dimorphic outcomes.
This analysis could also reveal samples that may have been contaminated in certain circumstances.
We used the idxstats utility from SAMTOOLS [@pmid:19505943] to calculate read lengths, the number of mapped reads, and the corresponding chromosomal location for reads to the X and Y chromosomes.
We used the fraction of total normalized X and Y chromosome reads that were attributed to the Y chromosome as a summary statistic.
We reviewed this statistic in the context of reported gender and determined that a threshold of less than 0.2 clearly delineated female samples.
Fractions greater than 0.4 were predicted to be males.
Samples with values in the range [0.2, 0.4] were marked as unknown.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Samples with values in the range [0.2, 0.4] were marked as unknown.
Samples with values in the range [0.2, 0.4] were deemed as contaminated and removed from the dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reminds me - we never added manuscript text about NGScheckmate, which is our major means of QC...I will add a ticket!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is another mechanism that also suggests that these samples are contaminated, that might not be the best place to discuss filtering due to contamination. An alternative explanation could be that some of these samples are from individuals with more than two sex chromosomes. I don't know that we want to go into that level of detail in this paragraph, but I don't think this is the point where we want to say that the samples are contaminated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. This sample was really removed due to NGScheckmate.

We ran this analysis through [CWL](https://github.com/d3b-center/sex-determination-tool) on Cavatica.
Resulting calls were added to the clinical metadata as `germline_sex_estimate`.