-
Notifications
You must be signed in to change notification settings - Fork 83
PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284
PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284
Conversation
- add analysis to `.circleci`
- add overexpressed genes column to final results and annotation - comment out all code after plotting the initial heatmap as this is able to run locally but fails `circleci`
…into atrt-molecular-subtyping
- separated data filtering from plotting (to include only data filtering script `ATRT-molecular-subtyping` in `.circleci`) - Add `get_mode` custom function for plot annotations
- convert plots to png for display in `display-heatmaps.md` - minor edit to headings in `display-heatmaps.md`
@cbethell thanks for working on this! One quick note - I would switch to using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell,
I have some general comments before digging into the code line by line.
The first thing that jumps out for me with this PR is that the final tabular format has many entries for an individual sample and this appears to be mostly due to the "long format" of the z-score column. Because this data is meant for human consumption, rather than plotting, I think presenting in wide format will be more effective. That's what I was after with the table over on #244.
I had (probably) written up that comment on #244 before I had a full understanding/appreciation of the sample_id
mapping that would be required like in oncoprint-plotting
here and focal-cn-file-preparation
here. So it might be better to present the table as:
Kids_First_Participant_ID | sample_id | Kids_First_Biospecimen_ID | age at diagnosis (days) | reported gender | primary site | location summary | TYR expression z-score | ... | ... | SHH ssGSEA score (rank?) | Notch ssGSEA score (rank?) | Focal SMARCB1 status | Focal SMARCA4 status | chr22q loss | Tumor Mutation Burder (rank?) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PT_XXXXXXXX | 7316-XXX | BS_XXXXXXXX, BS_XXXXXXXX | 800 | Female | Cerebellum/Posterior Fossa | infratentorial | 4.235 | ... | ... | 25 | 18 | loss | neutral | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Where the biospecimen identifier column contains a collapsed list of all Kids_First_Biospecimen_ID
associated with a sample_id
. You can see an example of how to do this with dplyr::summarize
(using different variables) here.
I have some suggestions for the values that need to be z-scored that might make this wide format easier to accomplish. I would take a slightly different approach where you subset the matrix to include only the genes/scores you are interested in (IIRC the ssGSEA matrix has gene sets as rows) and use sweep
to get z-score values (example in fusion_filtering
here). (Note this requires a numeric matrix, so for any kind of identifier to be removed.)
So you would have:
identifier of some sort as rownames |
BS_XXXXXXXX | BS_XXXXXXXX | BS_XXXXXXXX | ... | ... |
---|---|---|---|---|---|
TYR | 0.834890 | 1.455240 | -3.329482 | ... | ... |
MYCN | ... | ... | ... | ... | ... |
And you could transpose this and make the biospecimen identifiers a column:
Kids_First_Biospecimen_ID | TYR | MYCN |
---|---|---|
BS_XXXXXXXX | 0.834890 | ... |
BS_XXXXXXXX | 1.455240 | ... |
BS_XXXXXXXX | -3.329482 | ... |
... | ... | ... |
... | ... | ... |
... | ... | ... |
And now this is a lot closer to what needs to be in the wide format data frame. You would repeat this process for the ssGSEA scores.
In general, I would aim to get the data from each of the required steps in a form that's close to what you want, merge sample_id
to each individual data type using the biospecimen identifiers, and then merge everything together using the sample_id
column. You'll mostly likely end up with multiple biospecimen identifier columns that you'll need to clean up this way, probably with the help of dplyr::summarize
.
Since you're hardcoding the file names rather than passing them as arguments, I don't necessarily see a benefit to the data prep steps being in a script rather than a notebook that can be browsed by someone interested in how we did the subtyping later. You can also take a similar approach to displaying things with flextable
as I did in analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd
.
Please let me know if you have questions, there is a lot of info in there ☝️ !
dir.create(plots_dir) | ||
} | ||
|
||
# Source script with data preparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't source this, use a shell script instead and number these scripts in accordance with the order that they should be run in.
"pbta-gene-expression-rsem-fpkm.stranded.rds" | ||
)) | ||
|
||
# Read in focal CN data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a TODO about what focal CN data you'll be using because we'll get consensus files #128.
…into atrt-molecular-subtyping
- convert `ATRT-molecular-subtyping-data-prep.R` script into notebook `01-ATRT-molecular-subtyping-data-prep.Rmd` - number scripts - add `run-molecular-subtyping-ATRT.sh` to run the script and notebook in this module - use `flextable` to display tables in data prep notebook - convert `display-heatmaps.md` into README.md to explain the content of this module and display the heatmaps produced - update `.circleci` appropriately
I made the changes suggested in @jaclyn-taroni's comment above.
RE the above suggestion, I was not successful in my attempt thus far in using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell, the steps in the data prep notebook are looking good. I had a few more comments on how to wrangle the data. I also don't think flextable
is always the best way to display the data.frame in that file. You might also want to take a look at DT
which I learned about on over on analyses/collapse-rnaseq
here for the final table and use the standard notebook display for interim products.
Can we split the plotting step into its own PR? As you do that, I would consider if you can do your annotation data.frame prep in the first notebook and write that to file. I would probably remove the README and the shell script for the moment as well. If you find that you can reduce the number of lines in 02-ATRT-molecular-subtyping-plotting.R
by changing the approach to the annotation data.frame prep, those 3 things can potentially go into the same PR. Thanks!
"Parietal Lobe;Temporal Lobe", | ||
"Frontal Lobe;Parietal Lobe", | ||
"Frontal Lobe;Meninges/Dura;Parietal Lobe;Spinal Cord- Cervical;Spinal Cord- Lumbar/Thecal Sac;Spinal Cord- Thoracic;Temporal Lobe;Ventricles" | ||
# Not sure how the line above should be broken up as `Frontal Lobe` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in these cases this gets coded as NA
or something else and the primary_site
also gets included in your final table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
TRUE ~ "neutral"), | ||
SMARCA4_focal_status = dplyr::case_when(gene_symbol == "SMARCA4" ~ status, | ||
TRUE ~ "neutral"), | ||
Overexpressed_gene_sets = dplyr::case_when( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me a bit more about including this variable? Given the name Overexpressed_gene_sets
, I would expect this to capture information about the z-score values but this appears to be related to loss status in the CN file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you would be correct in expecting it to capture information about the z-scores. The logic here is indeed incorrect. However, I am not yet sure how to best determine a cutoff for overexpression using z-scores.
TRUE ~ "NA" | ||
) | ||
) %>% | ||
dplyr::select(-gene_symbol) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably drop status
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If included status
is for display, I'd probably leave gene_symbol
here and take it out at line 279.
# Join ATRT expression data with focal CN data | ||
atrt_expression_cn_df <- atrt_expression_df %>% | ||
dplyr::left_join(cn_metadata, by = "sample_id") %>% | ||
dplyr::select(-status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see you drop it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somewhere in this code block you want to summarize the copy number information such that where a sample_id
had both loss
and gain
entries for SMARCB1_focal_status
it becomes gain, loss
in the SMARCB1_focal_status
so you don't have duplicate sample_id
. (Probably also made simpler by removing the overexpressed gene set step if you don't need it.) You will probably have to ungroup and join with the other metadata after this summarization step.
Kids_First_Participant_ID, | ||
biospecimen_id, | ||
status) %>% | ||
dplyr::filter(sample_id %in% atrt_expression_df$sample_id) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also filter here to the the relevant gene_symbol
.
|
||
```{r} | ||
# Join tumor mutuation data with metadata | ||
tmb_df <- tmb_df %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using the subset files here by any chance? Is that why all of tmb
is NA
in the final table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this point, you can delete the data/snv-consensus_11122019
folder you have locally and rerun bash download-data.sh
or you could check out the branch over on #295 which allows you to overwrite the testing files I suspect you have locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct.
HALLMARK_MYC_TARGETS_V1 = mean(HALLMARK_MYC_TARGETS_V1), | ||
HALLMARK_MYC_TARGETS_V2 = mean(HALLMARK_MYC_TARGETS_V2), | ||
HALLMARK_NOTCH_SIGNALLING = mean(HALLMARK_NOTCH_SIGNALING), | ||
tmb = mean(tmb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that you have to do all of these mean
steps because you have duplicate sample identifiers. If that's correct, I'd try summarizing the CN data before joining it with the other data.
# Save final data.frame | ||
final_df <- atrt_expression_cn_tmb_df %>% | ||
dplyr::group_by(sample_id) %>% | ||
dplyr::summarise(Kids_First_Biospecimen_ID = paste(sort(unique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try pasting the entries in the multiple Kids_First_Biospecimen_ID.*
columns together into a new column with mutate
?
- remove plotting script, shell script and `README.md` from this PR - remove plots from this PR - adjust the way the tables are displayed in the html output - refactor the way some of the data are summarized - remove `Overexpressed_gene_sets` column - improve list of target genes (using the literature referenced in the initial ticket)
I addressed @jaclyn-taroni's most recent comments in the last commit. |
…into atrt-molecular-subtyping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very close. A couple things I'd like to see changed before merging:
- Some changes to what gets moved to the downstream plotting step and what interim products get saved in service of having more flexibility downstream.
- Removing the copy number information for genes outside of SMARCA4 and SMARCB1 for now. There is quite a bit of information in the table and I understand this to be outside of what is detailed in Proposed Analysis: Molecularly subtype ATRT tumors #244.
- Move to using the v11 files since we're going to get all of the v11 changes in ASAP.
file.path( | ||
root_dir, | ||
"data", | ||
"snv-consensus_11122019", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you switch to using v11 files here? Might as well make the change before getting it merged since we'll prioritize getting the v11 stuff in.
# Make a data.frame with only the expression values for ATRT samples | ||
stranded_expression_df <- stranded_expression %>% | ||
as.data.frame() %>% | ||
dplyr::select(column_names$.) | ||
|
||
# Create a correlation matrix of the expression data | ||
initial_mat <- cor(stranded_expression_df, method = "pearson") | ||
|
||
# Write initial matrix to file | ||
readr::write_rds(initial_mat, | ||
file.path(results_dir, "initial_heatmap_matrix.RDS")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this correlation step to the plotting script and consider instead saving the z-scored ATRT only matrix as part of this script -- that way you have more flexibility and can also perform something like UMAP downstream if you'd like.
) | ||
|
||
# Filter expression data for target overexpressed genes | ||
stranded_expression_wide_df <- stranded_expression %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving to using the collapsed file that's included in v11? Would need to change lines 78-84.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea. I can make this change and commit once v11 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha 👍
final_mat <- cor(atrt_expression_mat, method = "pearson") | ||
|
||
# Write final matrix to file | ||
readr::write_rds(final_mat, | ||
file.path(results_dir, "final_heatmap_matrix.RDS")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take this file prep out as well. This is similar to what you have for the initial matrix but also includes ssGSEA scores, do I have that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. Will do.
gene_symbol %in% tyr_genes | | ||
gene_symbol %in% shh_genes | | ||
gene_symbol %in% myc_genes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a gene_symbol
column that's collapsed in the final table and I find it confusing. I assume that this comes from this step because FOXC1 is included. Can we take these genes out of the copy number analyses altogether? I understand #244 to reference these genes in the context of expression but not copy number alterations.
SMARCA4_focal_status = paste(sort(unique( | ||
SMARCA4_focal_status | ||
)), collapse = ", "), | ||
gene_symbol = paste(sort(unique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes okay, this is consistent with my assumption above.
)) | ||
|
||
# Make the annotation data.frame a Heatmap Annotation object | ||
initial_ha_atrt <- HeatmapAnnotation(df = initial_annotation_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably save the data.frame
instead of the output of ComplexHeatmap::HeatmapAnnotation
because it's more flexible downstream and you don't need to use the ComplexHeatmap
package here.
…into atrt-molecular-subtyping
- remove step creating correlation matrix - save annotations for upcoming heatmaps as data.frame instead of as Heatmap Annotation objects - remove copy number information for genes outside of `SMARCA4` and `SMARCB1` - switch to using collapsed expression file
- z-score log transformed dat - filter CN data - c() identifiers for final table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM, anything else I noticed I addressed with fe685aa
Purpose/implementation Section
To molecularly subtype ATRT samples into SHH, MYC, and TYR.
What scientific question is your analysis addressing?
What are the samples that fit into the ATRT molecular subtypes?
What was your approach?
ComplexHeatmap
package to plot an initial Heatmap which can be found atplots/intial_heatmap.pdf
.A slightly more detailed plan can be found here.
What GitHub issue does your pull request address?
This PR addresses issue #244.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
supratentorial
andinfratentorial
regions using values inprimary_site
from the metadata. I used this figure as a reference.However, I was unclear how to classify a specific two values in the
primary_site
column (of which I made a comment referencing in the script).Is there anything that you want to discuss further?
I have SV data left to incorporate (this is also noted in the script).
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
This is still a work in progress, but the resulting figures and table of results can be viewed in the meantime and any relevant feedback is welcome.
Results
What types of results are included (e.g., table, figure)?
Heatmap plots:
plots/initial_heatmap.pdf
plots/final_annotated_heatmap.pdf
These plots can be viewed in
display-heatmaps.md
here.Table of results (mimicking the table format in this comment):
results/ATRT_molecular_subtypes.tsv.gz
What is your summary of the results?
The results are not yet conclusive enough to identify the samples that belong to each subtype. However, the heatmap suggests that there are indeed atleast three clusters represented in the data.
Reproducibility Checklist
PR Checklist