PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284

cbethell · 2019-11-20T16:38:54Z

Purpose/implementation Section

To molecularly subtype ATRT samples into SHH, MYC, and TYR.

What scientific question is your analysis addressing?

What are the samples that fit into the ATRT molecular subtypes?

What was your approach?

I used the ComplexHeatmap package to plot an initial Heatmap which can be found at plots/intial_heatmap.pdf.
I then joined the metadata, RNA expression, focal CN, the ssGSEA pathway, and tumor mutation burden data.
I then plotted a final heatmap annotated with the joined information.

A slightly more detailed plan can be found here.

What GitHub issue does your pull request address?

This PR addresses issue #244.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

I manually defined the supratentorial and infratentorial regions using values in primary_site from the metadata. I used this figure as a reference.
However, I was unclear how to classify a specific two values in the primary_site column (of which I made a comment referencing in the script).
I am not sure I produced the correlation matrix (for the heatmap) in the most correct way. Some feedback here would be helpful.
I produced a table of results as outlined in this comment, but did not clearly outline which samples fit into which subtype. Are there any suggestions on how to efficiently separate these samples into subtypes and represent this information in the results tsv file using the information represented on the heatmap and on this comment.
Is there any obvious refactoring needed?

Is there anything that you want to discuss further?

I have SV data left to incorporate (this is also noted in the script).

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

This is still a work in progress, but the resulting figures and table of results can be viewed in the meantime and any relevant feedback is welcome.

Results

What types of results are included (e.g., table, figure)?

Heatmap plots:

plots/initial_heatmap.pdf
plots/final_annotated_heatmap.pdf

These plots can be viewed in display-heatmaps.md here.

Table of results (mimicking the table format in this comment):

results/ATRT_molecular_subtypes.tsv.gz

What is your summary of the results?

The results are not yet conclusive enough to identify the samples that belong to each subtype. However, the heatmap suggests that there are indeed atleast three clusters represented in the data.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

PR Checklist

Run a linter
Set the seed
Comments and/or documentation up to date
Double check your paths
Spell check any Rmd file or md file
Restart R and run all notebooks fresh and save

- add analysis to `.circleci`

- add overexpressed genes column to final results and annotation - comment out all code after plotting the initial heatmap as this is able to run locally but fails `circleci`

…into atrt-molecular-subtyping

- separated data filtering from plotting (to include only data filtering script `ATRT-molecular-subtyping` in `.circleci`) - Add `get_mode` custom function for plot annotations

- convert plots to png for display in `display-heatmaps.md` - minor edit to headings in `display-heatmaps.md`

jharenza · 2019-11-22T22:24:52Z

@cbethell thanks for working on this! One quick note - I would switch to using germline_sex_estimate in case any of the reported_genders were incorrect.

- re-run plots

cbethell · 2019-11-25T14:23:12Z

@cbethell thanks for working on this! One quick note - I would switch to using germline_sex_estimate in case any of the reported_genders were incorrect.

Thank you @jharenza. I have incorporated and committed that change.

jaclyn-taroni

Hi @cbethell,

I have some general comments before digging into the code line by line.

The first thing that jumps out for me with this PR is that the final tabular format has many entries for an individual sample and this appears to be mostly due to the "long format" of the z-score column. Because this data is meant for human consumption, rather than plotting, I think presenting in wide format will be more effective. That's what I was after with the table over on #244.

I had (probably) written up that comment on #244 before I had a full understanding/appreciation of the sample_id mapping that would be required like in oncoprint-plotting here and focal-cn-file-preparation here. So it might be better to present the table as:

Kids_First_Participant_ID	sample_id	Kids_First_Biospecimen_ID	age at diagnosis (days)	reported gender	primary site	location summary	TYR expression z-score	...	...	SHH ssGSEA score (rank?)	Notch ssGSEA score (rank?)	Focal SMARCB1 status	Focal SMARCA4 status	chr22q loss	Tumor Mutation Burder (rank?)
PT_XXXXXXXX	7316-XXX	BS_XXXXXXXX, BS_XXXXXXXX	800	Female	Cerebellum/Posterior Fossa	infratentorial	4.235	...	...	25	18	loss	neutral	...
...	...	...	...	...	...	...	...	...	...	...	...	...	...

Where the biospecimen identifier column contains a collapsed list of all Kids_First_Biospecimen_ID associated with a sample_id. You can see an example of how to do this with dplyr::summarize (using different variables) here.

I have some suggestions for the values that need to be z-scored that might make this wide format easier to accomplish. I would take a slightly different approach where you subset the matrix to include only the genes/scores you are interested in (IIRC the ssGSEA matrix has gene sets as rows) and use sweep to get z-score values (example in fusion_filtering here). (Note this requires a numeric matrix, so for any kind of identifier to be removed.)

So you would have:

identifier of some sort as `rownames`	BS_XXXXXXXX	BS_XXXXXXXX	BS_XXXXXXXX	...	...
TYR	0.834890	1.455240	-3.329482	...	...
MYCN	...	...	...	...	...

And you could transpose this and make the biospecimen identifiers a column:

Kids_First_Biospecimen_ID	TYR	MYCN
BS_XXXXXXXX	0.834890	...
BS_XXXXXXXX	1.455240	...
BS_XXXXXXXX	-3.329482	...
...	...	...
...	...	...
...	...	...

And now this is a lot closer to what needs to be in the wide format data frame. You would repeat this process for the ssGSEA scores.

In general, I would aim to get the data from each of the required steps in a form that's close to what you want, merge sample_id to each individual data type using the biospecimen identifiers, and then merge everything together using the sample_id column. You'll mostly likely end up with multiple biospecimen identifier columns that you'll need to clean up this way, probably with the help of dplyr::summarize.

Since you're hardcoding the file names rather than passing them as arguments, I don't necessarily see a benefit to the data prep steps being in a script rather than a notebook that can be browsed by someone interested in how we did the subtyping later. You can also take a similar approach to displaying things with flextable as I did in analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd.

Please let me know if you have questions, there is a lot of info in there ☝️ !

jaclyn-taroni · 2019-11-25T13:34:01Z

analyses/molecular-subtyping-ATRT/ATRT-molecular-subtyping-plotting.R

+  dir.create(plots_dir)
+}
+
+# Source script with data preparation


Don't source this, use a shell script instead and number these scripts in accordance with the order that they should be run in.

jaclyn-taroni · 2019-11-25T14:09:26Z

analyses/molecular-subtyping-ATRT/ATRT-molecular-subtyping.R

+    "pbta-gene-expression-rsem-fpkm.stranded.rds"
+  ))
+
+# Read in focal CN data


This needs a TODO about what focal CN data you'll be using because we'll get consensus files #128.

…into atrt-molecular-subtyping

- convert `ATRT-molecular-subtyping-data-prep.R` script into notebook `01-ATRT-molecular-subtyping-data-prep.Rmd` - number scripts - add `run-molecular-subtyping-ATRT.sh` to run the script and notebook in this module - use `flextable` to display tables in data prep notebook - convert `display-heatmaps.md` into README.md to explain the content of this module and display the heatmaps produced - update `.circleci` appropriately

cbethell · 2019-11-26T17:58:58Z

I made the changes suggested in @jaclyn-taroni's comment above.
The plots can now be viewed in the README.md file here.

You'll mostly likely end up with multiple biospecimen identifier columns that you'll need to clean up this way, probably with the help of dplyr::summarize.

RE the above suggestion, I was not successful in my attempt thus far in using dplyr::summarize to clean up the multiple columns so I am sure this particular section can use some refactoring, which I am still working on.

jaclyn-taroni

Hi @cbethell, the steps in the data prep notebook are looking good. I had a few more comments on how to wrangle the data. I also don't think flextable is always the best way to display the data.frame in that file. You might also want to take a look at DT which I learned about on over on analyses/collapse-rnaseq here for the final table and use the standard notebook display for interim products.

Can we split the plotting step into its own PR? As you do that, I would consider if you can do your annotation data.frame prep in the first notebook and write that to file. I would probably remove the README and the shell script for the moment as well. If you find that you can reduce the number of lines in 02-ATRT-molecular-subtyping-plotting.R by changing the approach to the annotation data.frame prep, those 3 things can potentially go into the same PR. Thanks!

jaclyn-taroni · 2019-11-27T01:56:24Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+    "Parietal Lobe;Temporal Lobe",
+    "Frontal Lobe;Parietal Lobe",
+    "Frontal Lobe;Meninges/Dura;Parietal Lobe;Spinal Cord- Cervical;Spinal Cord- Lumbar/Thecal Sac;Spinal Cord- Thoracic;Temporal Lobe;Ventricles"
+    # Not sure how the line above should be broken up as `Frontal Lobe` is


Maybe in these cases this gets coded as NA or something else and the primary_site also gets included in your final table?

Sounds good to me.

jaclyn-taroni · 2019-11-27T02:03:15Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+                                            TRUE ~ "neutral"),
+    SMARCA4_focal_status = dplyr::case_when(gene_symbol == "SMARCA4" ~ status,
+                                            TRUE ~ "neutral"),
+    Overexpressed_gene_sets = dplyr::case_when(


Can you tell me a bit more about including this variable? Given the name Overexpressed_gene_sets, I would expect this to capture information about the z-score values but this appears to be related to loss status in the CN file.

Yes, you would be correct in expecting it to capture information about the z-scores. The logic here is indeed incorrect. However, I am not yet sure how to best determine a cutoff for overexpression using z-scores.

jaclyn-taroni · 2019-11-27T02:13:57Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+      TRUE ~ "NA"
+    )
+  ) %>%
+  dplyr::select(-gene_symbol) %>%


Probably drop status here too?

If included status is for display, I'd probably leave gene_symbol here and take it out at line 279.

jaclyn-taroni · 2019-11-27T02:14:14Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+# Join ATRT expression data with focal CN data
+atrt_expression_cn_df <- atrt_expression_df %>%
+  dplyr::left_join(cn_metadata, by = "sample_id") %>%
+  dplyr::select(-status)


Ah I see you drop it here.

I think somewhere in this code block you want to summarize the copy number information such that where a sample_id had both loss and gain entries for SMARCB1_focal_status it becomes gain, loss in the SMARCB1_focal_status so you don't have duplicate sample_id. (Probably also made simpler by removing the overexpressed gene set step if you don't need it.) You will probably have to ungroup and join with the other metadata after this summarization step.

jaclyn-taroni · 2019-11-27T02:24:54Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+                Kids_First_Participant_ID,
+                biospecimen_id,
+                status) %>%
+  dplyr::filter(sample_id %in% atrt_expression_df$sample_id) %>%


I would also filter here to the the relevant gene_symbol.

jaclyn-taroni · 2019-11-27T02:41:04Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+
+```{r}
+# Join tumor mutuation data with metadata
+tmb_df <- tmb_df %>%


Are you using the subset files here by any chance? Is that why all of tmb is NA in the final table?

Regarding this point, you can delete the data/snv-consensus_11122019 folder you have locally and rerun bash download-data.sh or you could check out the branch over on #295 which allows you to overwrite the testing files I suspect you have locally.

You are correct.

jaclyn-taroni · 2019-11-27T02:43:29Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+  HALLMARK_MYC_TARGETS_V1 = mean(HALLMARK_MYC_TARGETS_V1),
+  HALLMARK_MYC_TARGETS_V2 = mean(HALLMARK_MYC_TARGETS_V2),
+  HALLMARK_NOTCH_SIGNALLING = mean(HALLMARK_NOTCH_SIGNALING),
+  tmb = mean(tmb),


I assume that you have to do all of these mean steps because you have duplicate sample identifiers. If that's correct, I'd try summarizing the CN data before joining it with the other data.

jaclyn-taroni · 2019-11-27T02:44:29Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+# Save final data.frame
+final_df <- atrt_expression_cn_tmb_df %>%
+  dplyr::group_by(sample_id) %>%
+  dplyr::summarise(Kids_First_Biospecimen_ID = paste(sort(unique(


Try pasting the entries in the multiple Kids_First_Biospecimen_ID.* columns together into a new column with mutate ?

- remove plotting script, shell script and `README.md` from this PR - remove plots from this PR - adjust the way the tables are displayed in the html output - refactor the way some of the data are summarized - remove `Overexpressed_gene_sets` column - improve list of target genes (using the literature referenced in the initial ticket)

cbethell · 2019-11-27T17:58:30Z

I addressed @jaclyn-taroni's most recent comments in the last commit.
The html output can now be viewed here.

…into atrt-molecular-subtyping

jaclyn-taroni

This is very close. A couple things I'd like to see changed before merging:

Some changes to what gets moved to the downstream plotting step and what interim products get saved in service of having more flexibility downstream.
Removing the copy number information for genes outside of SMARCA4 and SMARCB1 for now. There is quite a bit of information in the table and I understand this to be outside of what is detailed in Proposed Analysis: Molecularly subtype ATRT tumors #244.
Move to using the v11 files since we're going to get all of the v11 changes in ASAP.

jaclyn-taroni · 2019-12-01T17:11:28Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+    file.path(
+      root_dir,
+      "data",
+      "snv-consensus_11122019",


Can you switch to using v11 files here? Might as well make the change before getting it merged since we'll prioritize getting the v11 stuff in.

jaclyn-taroni · 2019-12-01T17:18:24Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+# Make a data.frame with only the expression values for ATRT samples
+stranded_expression_df <- stranded_expression %>%
+  as.data.frame() %>%
+  dplyr::select(column_names$.)
+
+# Create a correlation matrix of the expression data
+initial_mat <- cor(stranded_expression_df, method = "pearson")
+
+# Write initial matrix to file
+readr::write_rds(initial_mat,
+                file.path(results_dir, "initial_heatmap_matrix.RDS"))


I would move this correlation step to the plotting script and consider instead saving the z-scored ATRT only matrix as part of this script -- that way you have more flexibility and can also perform something like UMAP downstream if you'd like.

jaclyn-taroni · 2019-12-01T17:20:03Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+  )
+
+# Filter expression data for target overexpressed genes
+stranded_expression_wide_df <- stranded_expression %>%


What do you think about moving to using the collapsed file that's included in v11? Would need to change lines 78-84.

I like this idea. I can make this change and commit once v11 is merged.

You can commit before v11 is merged provided you grab the v11 data locally (by checking out yuankunzhu:release/v11 and rerunning the download step). We can wait to merge the pull request until #293 and #303 are merged/done. I am not planning on putting anything through before that.

Gotcha 👍

jaclyn-taroni · 2019-12-01T17:22:34Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+final_mat <- cor(atrt_expression_mat, method = "pearson")
+
+# Write final matrix to file
+readr::write_rds(final_mat,
+                file.path(results_dir, "final_heatmap_matrix.RDS"))


I would take this file prep out as well. This is similar to what you have for the initial matrix but also includes ssGSEA scores, do I have that right?

Yes, you are correct. Will do.

jaclyn-taroni · 2019-12-01T17:25:31Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+    gene_symbol %in% tyr_genes |
+      gene_symbol %in% shh_genes |
+      gene_symbol %in% myc_genes |


I see there's a gene_symbol column that's collapsed in the final table and I find it confusing. I assume that this comes from this step because FOXC1 is included. Can we take these genes out of the copy number analyses altogether? I understand #244 to reference these genes in the context of expression but not copy number alterations.

jaclyn-taroni · 2019-12-01T18:25:55Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+    SMARCA4_focal_status = paste(sort(unique(
+      SMARCA4_focal_status
+    )), collapse = ", "),
+    gene_symbol = paste(sort(unique(


Ah yes okay, this is consistent with my assumption above.

jaclyn-taroni · 2019-12-01T18:40:00Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+  ))
+
+# Make the annotation data.frame a Heatmap Annotation object
+initial_ha_atrt <- HeatmapAnnotation(df = initial_annotation_df)


I would probably save the data.frame instead of the output of ComplexHeatmap::HeatmapAnnotation because it's more flexible downstream and you don't need to use the ComplexHeatmap package here.

…into atrt-molecular-subtyping

- remove step creating correlation matrix - save annotations for upcoming heatmaps as data.frame instead of as Heatmap Annotation objects - remove copy number information for genes outside of `SMARCA4` and `SMARCB1` - switch to using collapsed expression file

…btyping

- z-score log transformed dat - filter CN data - c() identifiers for final table

jaclyn-taroni

👍 LGTM, anything else I noticed I addressed with fe685aa

cbethell added 6 commits November 20, 2019 10:47

Add ATRT-molecular-subtyping.R script

99afccf

- add analysis to `.circleci`

Add markdown displaying heatmaps

82eac34

- add overexpressed genes column to final results and annotation - comment out all code after plotting the initial heatmap as this is able to run locally but fails `circleci`

Update file paths

fc73a44

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

e6724b2

…into atrt-molecular-subtyping

Add ATRT-molecular-subtyping-plotting.R

9331063

- separated data filtering from plotting (to include only data filtering script `ATRT-molecular-subtyping` in `.circleci`) - Add `get_mode` custom function for plot annotations

Convert pdf plots to png plots

8cfc145

- convert plots to png for display in `display-heatmaps.md` - minor edit to headings in `display-heatmaps.md`

cbethell marked this pull request as ready for review November 22, 2019 21:05

cbethell changed the title ~~WIP: Molecular Subtyping - ATRT~~ Molecular Subtyping - ATRT Nov 22, 2019

cbethell requested a review from jaclyn-taroni November 22, 2019 21:05

Change selection for reported_gender to germline_sex_estimate

db56fa0

- re-run plots

jaclyn-taroni reviewed Nov 25, 2019

View reviewed changes

cbethell added 2 commits November 26, 2019 12:18

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

b6983a8

…into atrt-molecular-subtyping

cbethell added 2 commits November 26, 2019 13:09

Remove log2 transformation of ssGSEA data

47b9de8

Update comment and remove unused custom function in 02 plotting script

6128430

jaclyn-taroni reviewed Nov 27, 2019

View reviewed changes

jaclyn-taroni mentioned this pull request Nov 27, 2019

Fix: use unzip -o in download script #295

Merged

cbethell added 2 commits November 28, 2019 10:22

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

b986565

…into atrt-molecular-subtyping

Load in ComplexHeatmap library and add sessionInfo

3679217

jaclyn-taroni reviewed Dec 1, 2019

View reviewed changes

cbethell and others added 3 commits December 2, 2019 09:11

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

1238c90

…into atrt-molecular-subtyping

Use v11 files

4e19a78

- remove step creating correlation matrix - save annotations for upcoming heatmaps as data.frame instead of as Heatmap Annotation objects - remove copy number information for genes outside of `SMARCA4` and `SMARCB1` - switch to using collapsed expression file

Merge branch 'master' into atrt-molecular-subtyping

130faa3

cbethell changed the title ~~Molecular Subtyping - ATRT~~ PR 1 of 2: Molecular Subtyping - ATRT Dec 2, 2019

cbethell changed the title ~~PR 1 of 2: Molecular Subtyping - ATRT~~ PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) Dec 2, 2019

cbethell mentioned this pull request Dec 2, 2019

PR 2 of 2: Molecular Subtyping - ATRT (Plotting) #306

Merged

2 tasks

jaclyn-taroni added 2 commits December 2, 2019 19:42

Merge remote-tracking branch 'upstream/master' into atrt-molecular-su…

ba4936a

…btyping

Various fixes; rerun

fe685aa

- z-score log transformed dat - filter CN data - c() identifiers for final table

jaclyn-taroni approved these changes Dec 3, 2019

View reviewed changes

jaclyn-taroni merged commit 0b5ed04 into AlexsLemonade:master Dec 3, 2019

cbethell deleted the atrt-molecular-subtyping branch January 14, 2020 14:26

PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284

PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284

Conversation

cbethell commented Nov 20, 2019 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

PR Checklist

jharenza commented Nov 22, 2019

cbethell commented Nov 25, 2019

jaclyn-taroni left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Nov 26, 2019

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Nov 27, 2019

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

cbethell commented Nov 20, 2019 •

edited

Loading

jaclyn-taroni left a comment •

edited

Loading