Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284

Merged
merged 19 commits into from
Dec 3, 2019

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Nov 20, 2019

Purpose/implementation Section

To molecularly subtype ATRT samples into SHH, MYC, and TYR.

What scientific question is your analysis addressing?

What are the samples that fit into the ATRT molecular subtypes?

What was your approach?

  1. I used the ComplexHeatmap package to plot an initial Heatmap which can be found at plots/intial_heatmap.pdf.
  2. I then joined the metadata, RNA expression, focal CN, the ssGSEA pathway, and tumor mutation burden data.
  3. I then plotted a final heatmap annotated with the joined information.

A slightly more detailed plan can be found here.

What GitHub issue does your pull request address?

This PR addresses issue #244.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • I manually defined the supratentorial and infratentorial regions using values in primary_site from the metadata. I used this figure as a reference.
    However, I was unclear how to classify a specific two values in the primary_site column (of which I made a comment referencing in the script).
  • I am not sure I produced the correlation matrix (for the heatmap) in the most correct way. Some feedback here would be helpful.
  • I produced a table of results as outlined in this comment, but did not clearly outline which samples fit into which subtype. Are there any suggestions on how to efficiently separate these samples into subtypes and represent this information in the results tsv file using the information represented on the heatmap and on this comment.
  • Is there any obvious refactoring needed?

Is there anything that you want to discuss further?

I have SV data left to incorporate (this is also noted in the script).

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

This is still a work in progress, but the resulting figures and table of results can be viewed in the meantime and any relevant feedback is welcome.

Results

What types of results are included (e.g., table, figure)?

Heatmap plots:

  • plots/initial_heatmap.pdf
  • plots/final_annotated_heatmap.pdf

These plots can be viewed in display-heatmaps.md here.

Table of results (mimicking the table format in this comment):

  • results/ATRT_molecular_subtypes.tsv.gz

What is your summary of the results?

The results are not yet conclusive enough to identify the samples that belong to each subtype. However, the heatmap suggests that there are indeed atleast three clusters represented in the data.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

PR Checklist

  • Run a linter
  • Set the seed
  • Comments and/or documentation up to date
  • Double check your paths
  • Spell check any Rmd file or md file
  • Restart R and run all notebooks fresh and save

- add analysis to `.circleci`
- add overexpressed genes column to final results and annotation
- comment out all code after plotting the initial heatmap as this is able to run locally but fails `circleci`
- separated data filtering from plotting (to include only data filtering script `ATRT-molecular-subtyping` in `.circleci`)
- Add `get_mode` custom function for plot annotations
- convert plots to png for display in `display-heatmaps.md`
- minor edit to headings in `display-heatmaps.md`
@cbethell cbethell marked this pull request as ready for review November 22, 2019 21:05
@cbethell cbethell changed the title WIP: Molecular Subtyping - ATRT Molecular Subtyping - ATRT Nov 22, 2019
@jharenza
Copy link
Collaborator

@cbethell thanks for working on this! One quick note - I would switch to using germline_sex_estimate in case any of the reported_genders were incorrect.

@cbethell
Copy link
Contributor Author

@cbethell thanks for working on this! One quick note - I would switch to using germline_sex_estimate in case any of the reported_genders were incorrect.

Thank you @jharenza. I have incorporated and committed that change.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell,

I have some general comments before digging into the code line by line.

The first thing that jumps out for me with this PR is that the final tabular format has many entries for an individual sample and this appears to be mostly due to the "long format" of the z-score column. Because this data is meant for human consumption, rather than plotting, I think presenting in wide format will be more effective. That's what I was after with the table over on #244.

I had (probably) written up that comment on #244 before I had a full understanding/appreciation of the sample_id mapping that would be required like in oncoprint-plotting here and focal-cn-file-preparation here. So it might be better to present the table as:

Kids_First_Participant_ID sample_id Kids_First_Biospecimen_ID age at diagnosis (days) reported gender primary site location summary TYR expression z-score ... ... SHH ssGSEA score (rank?) Notch ssGSEA score (rank?) Focal SMARCB1 status Focal SMARCA4 status chr22q loss Tumor Mutation Burder (rank?)
PT_XXXXXXXX 7316-XXX BS_XXXXXXXX, BS_XXXXXXXX 800 Female Cerebellum/Posterior Fossa infratentorial 4.235 ... ... 25 18 loss neutral ...
... ... ... ... ... ... ... ... ... ... ... ... ... ...

Where the biospecimen identifier column contains a collapsed list of all Kids_First_Biospecimen_ID associated with a sample_id. You can see an example of how to do this with dplyr::summarize (using different variables) here.

I have some suggestions for the values that need to be z-scored that might make this wide format easier to accomplish. I would take a slightly different approach where you subset the matrix to include only the genes/scores you are interested in (IIRC the ssGSEA matrix has gene sets as rows) and use sweep to get z-score values (example in fusion_filtering here). (Note this requires a numeric matrix, so for any kind of identifier to be removed.)

So you would have:

identifier of some sort as rownames BS_XXXXXXXX BS_XXXXXXXX BS_XXXXXXXX ... ...
TYR 0.834890 1.455240 -3.329482 ... ...
MYCN ... ... ... ... ...

And you could transpose this and make the biospecimen identifiers a column:

Kids_First_Biospecimen_ID TYR MYCN
BS_XXXXXXXX 0.834890 ...
BS_XXXXXXXX 1.455240 ...
BS_XXXXXXXX -3.329482 ...
... ... ...
... ... ...
... ... ...

And now this is a lot closer to what needs to be in the wide format data frame. You would repeat this process for the ssGSEA scores.

In general, I would aim to get the data from each of the required steps in a form that's close to what you want, merge sample_id to each individual data type using the biospecimen identifiers, and then merge everything together using the sample_id column. You'll mostly likely end up with multiple biospecimen identifier columns that you'll need to clean up this way, probably with the help of dplyr::summarize.

Since you're hardcoding the file names rather than passing them as arguments, I don't necessarily see a benefit to the data prep steps being in a script rather than a notebook that can be browsed by someone interested in how we did the subtyping later. You can also take a similar approach to displaying things with flextable as I did in analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd.

Please let me know if you have questions, there is a lot of info in there ☝️ !

dir.create(plots_dir)
}

# Source script with data preparation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't source this, use a shell script instead and number these scripts in accordance with the order that they should be run in.

"pbta-gene-expression-rsem-fpkm.stranded.rds"
))

# Read in focal CN data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a TODO about what focal CN data you'll be using because we'll get consensus files #128.

- convert `ATRT-molecular-subtyping-data-prep.R` script into notebook `01-ATRT-molecular-subtyping-data-prep.Rmd`
- number scripts
- add `run-molecular-subtyping-ATRT.sh` to run the script and notebook in this module
- use `flextable` to display tables in data prep notebook
- convert `display-heatmaps.md` into README.md to explain the content of this module and display the heatmaps produced
- update `.circleci` appropriately
@cbethell
Copy link
Contributor Author

I made the changes suggested in @jaclyn-taroni's comment above.
The plots can now be viewed in the README.md file here.

You'll mostly likely end up with multiple biospecimen identifier columns that you'll need to clean up this way, probably with the help of dplyr::summarize.

RE the above suggestion, I was not successful in my attempt thus far in using dplyr::summarize to clean up the multiple columns so I am sure this particular section can use some refactoring, which I am still working on.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell, the steps in the data prep notebook are looking good. I had a few more comments on how to wrangle the data. I also don't think flextable is always the best way to display the data.frame in that file. You might also want to take a look at DT which I learned about on over on analyses/collapse-rnaseq here for the final table and use the standard notebook display for interim products.

Can we split the plotting step into its own PR? As you do that, I would consider if you can do your annotation data.frame prep in the first notebook and write that to file. I would probably remove the README and the shell script for the moment as well. If you find that you can reduce the number of lines in 02-ATRT-molecular-subtyping-plotting.R by changing the approach to the annotation data.frame prep, those 3 things can potentially go into the same PR. Thanks!

"Parietal Lobe;Temporal Lobe",
"Frontal Lobe;Parietal Lobe",
"Frontal Lobe;Meninges/Dura;Parietal Lobe;Spinal Cord- Cervical;Spinal Cord- Lumbar/Thecal Sac;Spinal Cord- Thoracic;Temporal Lobe;Ventricles"
# Not sure how the line above should be broken up as `Frontal Lobe` is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in these cases this gets coded as NA or something else and the primary_site also gets included in your final table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

TRUE ~ "neutral"),
SMARCA4_focal_status = dplyr::case_when(gene_symbol == "SMARCA4" ~ status,
TRUE ~ "neutral"),
Overexpressed_gene_sets = dplyr::case_when(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me a bit more about including this variable? Given the name Overexpressed_gene_sets, I would expect this to capture information about the z-score values but this appears to be related to loss status in the CN file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you would be correct in expecting it to capture information about the z-scores. The logic here is indeed incorrect. However, I am not yet sure how to best determine a cutoff for overexpression using z-scores.

TRUE ~ "NA"
)
) %>%
dplyr::select(-gene_symbol) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably drop status here too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If included status is for display, I'd probably leave gene_symbol here and take it out at line 279.

# Join ATRT expression data with focal CN data
atrt_expression_cn_df <- atrt_expression_df %>%
dplyr::left_join(cn_metadata, by = "sample_id") %>%
dplyr::select(-status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see you drop it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somewhere in this code block you want to summarize the copy number information such that where a sample_id had both loss and gain entries for SMARCB1_focal_status it becomes gain, loss in the SMARCB1_focal_status so you don't have duplicate sample_id. (Probably also made simpler by removing the overexpressed gene set step if you don't need it.) You will probably have to ungroup and join with the other metadata after this summarization step.

Kids_First_Participant_ID,
biospecimen_id,
status) %>%
dplyr::filter(sample_id %in% atrt_expression_df$sample_id) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also filter here to the the relevant gene_symbol.


```{r}
# Join tumor mutuation data with metadata
tmb_df <- tmb_df %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using the subset files here by any chance? Is that why all of tmb is NA in the final table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this point, you can delete the data/snv-consensus_11122019 folder you have locally and rerun bash download-data.sh or you could check out the branch over on #295 which allows you to overwrite the testing files I suspect you have locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct.

HALLMARK_MYC_TARGETS_V1 = mean(HALLMARK_MYC_TARGETS_V1),
HALLMARK_MYC_TARGETS_V2 = mean(HALLMARK_MYC_TARGETS_V2),
HALLMARK_NOTCH_SIGNALLING = mean(HALLMARK_NOTCH_SIGNALING),
tmb = mean(tmb),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you have to do all of these mean steps because you have duplicate sample identifiers. If that's correct, I'd try summarizing the CN data before joining it with the other data.

# Save final data.frame
final_df <- atrt_expression_cn_tmb_df %>%
dplyr::group_by(sample_id) %>%
dplyr::summarise(Kids_First_Biospecimen_ID = paste(sort(unique(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try pasting the entries in the multiple Kids_First_Biospecimen_ID.* columns together into a new column with mutate ?

- remove plotting script, shell script and `README.md` from this PR
- remove plots from this PR
- adjust the way the tables are displayed in the html output
- refactor the way some of the data are summarized
- remove `Overexpressed_gene_sets` column
- improve list of target genes (using the literature referenced in the initial ticket)
@cbethell
Copy link
Contributor Author

I addressed @jaclyn-taroni's most recent comments in the last commit.
The html output can now be viewed here.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very close. A couple things I'd like to see changed before merging:

  • Some changes to what gets moved to the downstream plotting step and what interim products get saved in service of having more flexibility downstream.
  • Removing the copy number information for genes outside of SMARCA4 and SMARCB1 for now. There is quite a bit of information in the table and I understand this to be outside of what is detailed in Proposed Analysis: Molecularly subtype ATRT tumors #244.
  • Move to using the v11 files since we're going to get all of the v11 changes in ASAP.

file.path(
root_dir,
"data",
"snv-consensus_11122019",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch to using v11 files here? Might as well make the change before getting it merged since we'll prioritize getting the v11 stuff in.

Comment on lines 205 to 215
# Make a data.frame with only the expression values for ATRT samples
stranded_expression_df <- stranded_expression %>%
as.data.frame() %>%
dplyr::select(column_names$.)

# Create a correlation matrix of the expression data
initial_mat <- cor(stranded_expression_df, method = "pearson")

# Write initial matrix to file
readr::write_rds(initial_mat,
file.path(results_dir, "initial_heatmap_matrix.RDS"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this correlation step to the plotting script and consider instead saving the z-scored ATRT only matrix as part of this script -- that way you have more flexibility and can also perform something like UMAP downstream if you'd like.

)

# Filter expression data for target overexpressed genes
stranded_expression_wide_df <- stranded_expression %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving to using the collapsed file that's included in v11? Would need to change lines 78-84.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. I can make this change and commit once v11 is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit before v11 is merged provided you grab the v11 data locally (by checking out yuankunzhu:release/v11 and rerunning the download step). We can wait to merge the pull request until #293 and #303 are merged/done. I am not planning on putting anything through before that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha 👍

Comment on lines 503 to 507
final_mat <- cor(atrt_expression_mat, method = "pearson")

# Write final matrix to file
readr::write_rds(final_mat,
file.path(results_dir, "final_heatmap_matrix.RDS"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take this file prep out as well. This is similar to what you have for the initial matrix but also includes ssGSEA scores, do I have that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. Will do.

Comment on lines 348 to 350
gene_symbol %in% tyr_genes |
gene_symbol %in% shh_genes |
gene_symbol %in% myc_genes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's a gene_symbol column that's collapsed in the final table and I find it confusing. I assume that this comes from this step because FOXC1 is included. Can we take these genes out of the copy number analyses altogether? I understand #244 to reference these genes in the context of expression but not copy number alterations.

SMARCA4_focal_status = paste(sort(unique(
SMARCA4_focal_status
)), collapse = ", "),
gene_symbol = paste(sort(unique(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes okay, this is consistent with my assumption above.

))

# Make the annotation data.frame a Heatmap Annotation object
initial_ha_atrt <- HeatmapAnnotation(df = initial_annotation_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably save the data.frame instead of the output of ComplexHeatmap::HeatmapAnnotation because it's more flexible downstream and you don't need to use the ComplexHeatmap package here.

cbethell and others added 3 commits December 2, 2019 09:11
- remove step creating correlation matrix
- save annotations for upcoming heatmaps as data.frame instead of as Heatmap Annotation objects
- remove copy number information for genes outside of `SMARCA4` and `SMARCB1`
- switch to using collapsed expression file
@cbethell cbethell changed the title Molecular Subtyping - ATRT PR 1 of 2: Molecular Subtyping - ATRT Dec 2, 2019
@cbethell cbethell changed the title PR 1 of 2: Molecular Subtyping - ATRT PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) Dec 2, 2019
- z-score log transformed dat
- filter CN data
- c() identifiers for final table
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, anything else I noticed I addressed with fe685aa

@jaclyn-taroni jaclyn-taroni merged commit 0b5ed04 into AlexsLemonade:master Dec 3, 2019
@cbethell cbethell deleted the atrt-molecular-subtyping branch January 14, 2020 14:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants