This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy path02-add-ploidy-consensus.Rmd
227 lines (183 loc) · 7.88 KB
/
02-add-ploidy-consensus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Add ploidy column, status to consensus SEG file"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: Chante Bethell and Jaclyn Taroni for ALSF CCDL
date: 2020
params:
base_run:
label: "1/0 to run with base histology"
value: 0
input: integer
---
The `pbta-histologies.tsv` file contains a `tumor_ploidy` column, which is tumor ploidy as inferred by ControlFreeC.
The copy number information should be interpreted in the light of this information (see: [current version of Data Formats file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/0e642ef0602ad04b8d0fbd80be626d4374fee928/doc/data-formats.md#a-note-on-ploidy)).
This notebook adds ploidy information to the consensus SEG file and adds a status column that defines gain and loss broadly.
**Note** that the consensus SEG file does not have copy number information for sex chromosomes.
## Usage
This notebook is intended to be run from the command line with the following (assumes you are in the root directory of the repository):
```
Rscript -e "rmarkdown::render('analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd', clean = TRUE)"
```
## Set up
### Libraries and functions
```{r}
library(tidyverse)
```
### Files and directories
```{r}
scratch_dir <- file.path("..", "..", "scratch")
output_dir <- file.path(scratch_dir, "cytoband_status", "segments")
if(!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
```
```{r}
# TODO: the consensus SEG file is not currently in the data download -- when it
# gets included we will have to change the file path here
consensus_seg_file <- file.path("..", "copy_number_consensus_call", "results",
"pbta-cnv-consensus.seg.gz")
if ( params$base_run ==0 ){
histologies_file <- file.path("..", "..", "data", "pbta-histologies.tsv")
} else {
histologies_file <- file.path("..", "..", "data", "pbta-histologies-base.tsv")
}
consensus_seg_df <- read_tsv(consensus_seg_file)
histologies_df <- read_tsv(histologies_file,
col_types = cols(
molecular_subtype = col_character()
))
```
## Add ploidy and status
As noted above, the sex chromosomes do not have copy number information included in the consensus SEG file, but other copy number values can be missing.
From the [consensus SEG file methods](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0e642ef0602ad04b8d0fbd80be626d4374fee928/analyses/copy_number_consensus_call#consensus-cnv-creation):
> The `copy.num` column is the weighted median of CNVkit segment values where they exist, or Control-FREEC values in the absence of CNVkit data. Because some software (notably GISTIC) requires all samples to have the same regions called, the copy number variants from `cnv_consensus.tsv` are supplementented with "neutral" segments where no call was made. These include all non-variant regions present in `ref/cnv_callable.bed` The neutral regions are assigned copy.num 2, except on chrX and chrY, where the copy number is left NA.
For segments with missing `copy.num`, what chromosomes are they on?
```{r}
consensus_seg_df %>%
filter(is.na(copy.num)) %>%
group_by(chrom) %>%
tally()
```
The majority are on the sex chromosomes, which is expected.
We can remove these, as we can not add status using the ploidy information without the `copy.num` information.
We need to add in the `tumor_ploidy` column from the histologies file.
```{r}
add_ploidy_df <- consensus_seg_df %>%
filter(!is.na(copy.num)) %>%
inner_join(select(histologies_df,
Kids_First_Biospecimen_ID,
tumor_ploidy,
germline_sex_estimate),
by = c("ID" = "Kids_First_Biospecimen_ID")) %>%
select(-tumor_ploidy, -germline_sex_estimate, everything())
```
### Handle autosomes first
```{r}
add_autosomes_df <- add_ploidy_df %>%
# x and y chromosomes must be handled differently
filter(!(chrom %in% c("chrX", "chrY"))) %>%
mutate(status = case_when(
# when the copy number is less than inferred ploidy, mark this as a loss
copy.num < tumor_ploidy ~ "loss",
# if copy number is higher than ploidy, mark as a gain
copy.num > tumor_ploidy ~ "gain",
copy.num == tumor_ploidy ~ "neutral"
))
```
### Handle sex chromosomes
```{r}
# this logic is consistent with what is observed in the controlfreec file
# specifically, in samples where germline sex estimate = Female, X chromosomes
# appear to be treated the same way as autosomes
add_xy_df <- add_ploidy_df %>%
filter(chrom %in% c("chrX", "chrY")) %>%
mutate(status = case_when(
germline_sex_estimate == "Male" & copy.num < (0.5 * tumor_ploidy) ~ "loss",
germline_sex_estimate == "Male" & copy.num > (0.5 * tumor_ploidy) ~ "gain",
germline_sex_estimate == "Male" & copy.num == (0.5 * tumor_ploidy) ~ "neutral",
# there are some instances where there are chromosome Y segments are in
# samples where the germline_sex_estimate is Female
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num > 0 ~ "unknown",
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num == 0 ~ "neutral",
# mirroring controlfreec, X treated same as autosomes
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num < tumor_ploidy ~ "loss",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num > tumor_ploidy ~ "gain",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num == tumor_ploidy ~ "neutral"
))
add_status_df <- bind_rows(add_autosomes_df, add_xy_df) %>%
select(-germline_sex_estimate) %>%
rename(Kids_First_Biospecimen_ID = ID)
```
### Does `seg.mean` agree with `status`?
```{r}
add_status_df %>%
filter(!is.na(seg.mean)) %>%
ggplot(aes(x = status, y = seg.mean, group = status)) +
geom_jitter()
```
```{r}
add_status_df %>%
filter(!is.na(seg.mean)) %>%
group_by(status) %>%
summarize(mean = mean(seg.mean), sd = sd(seg.mean))
```
### Write to file
```{r}
output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status.tsv")
write_tsv(add_status_df, output_file)
```
### Prepare separate bed files for losses/gains for bedtools coverage function
```{r}
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
group_by(Kids_First_Biospecimen_ID) %>%
arrange(chrom, loc.start, loc.end)
losses_bed_status_df <- bed_status_df %>%
filter(status == "loss")
gains_bed_status_df <- bed_status_df %>%
filter(status == "gain")
# make lists of data frames by sample
bed_status_list <- bed_status_df %>%
group_split()
names(bed_status_list) <- bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)
bed_loss_list <- losses_bed_status_df %>%
group_split()
names(bed_loss_list) <- losses_bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)
bed_gain_list <- gains_bed_status_df %>%
group_split()
names(bed_gain_list) <- gains_bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)
```
### Write to file
```{r}
temp <- purrr::imap(bed_status_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_callable.", .y, ".bed")
),
col_names = FALSE))
temp_loss <- purrr::imap(bed_loss_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_loss.", .y, ".bed")
),
col_names = FALSE))
temp_gain <- purrr::imap(bed_gain_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_gain.", .y, ".bed")
),
col_names = FALSE))
```
## Session Info
```{r}
sessionInfo()
```