Proposed Analysis: Comparative RNA-Seq analysis #229

GeoffLyle · 2019-11-05T00:30:20Z

Scientific goals

Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
Generate gene outlier thresholds for ribodeplete samples.
List outlier genes for each ribodeplete sample.
Discover trends in outlier genes by tumor subtype.
Investigate possible targeted therapeutics based on outlier expression profile.

Proposed methods

The correlations will be calculated using pairwise Spearman correlation of the RNA-Seq gene expression profiles.
The gene expression profile data will be filtered using a developed method described in Vaske et al.
Note: This process is the precursor to creating a TumorMap (https://tumormap.ucsc.edu/). While programmatic creation of a TumorMap is not currently open access, we can add instructions for users on how to manually generate a TumorMap on the website.
Outlier thresholds will be calculated according to the Tukey method.
Discovery of outlier genes and possible targeted therapeutics will be based on the Treehouse CARE method. https://github.com/UCSC-Treehouse/CARE
- The CARE method will be used to find outliers using the entire population as a cohort as well as subsamples from the population to disentangle sample specific outliers from tumor subtype outliers.
Application of novel transfer learning method trained on CCLE response data to Open-PBTA samples will predict response to therapeutics.

Required input data

pbta-gene-expression-kallisto.polya.rds
pbta-gene-expression-kallisto.stranded.rds

Is it possible to get RSEM TPM values for this data?
That would make it easier for us to adapt existing algorithms to work with the Open-PBTA data.

Proposed timeline

8 weeks

Relevant literature

Method published in [Vaske et al. Jama Open Network. 2019.] (https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2753519)

The text was updated successfully, but these errors were encountered:

jharenza · 2019-11-05T13:28:31Z

hi @GeoffLyle! Thanks for proposing this analysis! We currently have RSEM FPKM values available in our data release: pbta-gene-expression-rsem-fpkm.polya.rds and pbta-gene-expression-rsem-fpkm.stranded.rds. Since FPKM and TPM from RSEM are highly correlated, will this do, or does your analysis require TPM? If the latter, I can file an issue to add the TPM data during our next sprint, which starts Nov. 13 and the data could be expected sometime around the end of next week.

GeoffLyle · 2019-11-05T20:07:41Z

@jharenza I believe our analysis could work with the FPKM or Kallisto TPM values. However, our current workflow is set up tot use RSEM TPM values and our learning model was trained on TPM data. Since our team is much more comfortable working with RSEM TPM, it would be great if that data could be provided. By the end of next week works perfectly well for our timeline. Thanks!

jharenza · 2019-11-05T20:54:41Z

OK, will add to our sprint, thanks!

GeoffLyle · 2019-11-06T18:45:22Z

Hi @jharenza, I talked with our bioinformatician and she informed me that we are interested in both Transcript Level and Gene Level RSEM TPM. Would it possible to generate both of these files during your sprint?

jharenza · 2019-11-07T00:17:42Z

@GeoffLyle will do both!

jaclyn-taroni · 2019-11-07T20:32:49Z

Hi @GeoffLyle I'm looking forward to these analyses. Thank you for proposing them and thanks @jharenza for your team's help with the RSEM TPM.

I wanted to ask about how you were planning to split up the analyses for submission. Would each bullet point under scientific goals be it's own pull request or will some steps be grouped together?

I also wanted to note that we are planning to flesh out the molecular subtype information (#19) which may be of scientific interest here. If so, the column molecular_subtype in the pbta-histologies.tsv that is included in the data download would be the one that gets filled in with more information and is therefore the one you may want use during development.

GeoffLyle · 2019-11-07T23:25:49Z

@jaclyn-taroni Our submission of analyses will follow the scientific goals bullet points. Our first analysis will definitely be Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data. The Generate gene outlier thresholds for ribodeplete samples. and List outlier genes for each ribodeplete sample. may be two separate analyses or combined into one pull request depending on the speed and ease of engineering our algorithm to work with the data. The Discover trends in outlier genes by tumor subtype. and Investigate possible targeted therapeutics based on outlier expression profile. are broad goals we are still determining how to implement. Typically this analysis is done on a N-of-1 patient sample so I will have to explore the data to determine what trends there are and how best to present them. Also we have some modeling software our graduate students have been working on that we would like to adapt to the Open-PBTA data and submit as an analysis.

In short, the first 3 scientific goals will be submitted as 2 or 3 pull requests. The last two scientific goals may be submitted in 2+ pull requests.

Thanks for the heads up on the pbta-histologies.tsv update. We planned to look into differences by disease and histology. It would be interesting to see if there is a signal when grouping by molecular subtype.

jaclyn-taroni · 2019-11-08T13:05:26Z

Sounds good @GeoffLyle, thank you! If you have any questions that pop up as you're implementing the correlation matrices step, please let us know.

jharenza · 2019-12-02T22:14:42Z

Hi @GeoffLyle - hopefully you have been following, but the TPM data you requested became available with release V10 (#273). Excited to see your PRs come through!

GeoffLyle · 2019-12-02T22:18:53Z

@jharenza Thank you! Saw the V10 release. I'll let the team know about the V11 release. Should be a good test to ensure everything still runs correctly.

jharenza · 2019-12-02T22:19:55Z

@GeoffLyle ahh spoke too soon and corrected it. @jaclyn-taroni is running V11 tests currently, then we will merge once everything works properly :).

jaclyn-taroni · 2020-01-14T22:52:04Z

Hi @GeoffLyle and @e-t-k, I wanted to check in. Do you have an idea of when you expect to file a pull request for the outlier thresholds and outlier gene lists? Also, if you are working with the release-v12-20191217 data, I wanted to make you aware that there are stranded poly-A samples in the stranded dataset files, which are mostly comprised of ribodepleted samples (see #374) though this very well may have been apparent from your results. Those samples will come out in the v13 release, which is planned for next week (#373).

e-t-k · 2020-01-15T07:29:46Z

Hi @jaclyn-taroni , I've been working on some other priorities recently, so current estimate for this is early Feb, but I could look into making it sooner if that's behind why you're asking. And thank you for the heads up regarding the stranded PolyA samples; your exclusion of them in the next release works just fine for us. @GeoffLyle also noticed there are some cell line samples in the data set that we'll want to exclude as well, which we were expecting to just implement as a filtering step.

jaclyn-taroni · 2020-01-18T17:13:03Z

Hi @e-t-k,

Thanks for your reply!

I've been working on some other priorities recently, so current estimate for this is early Feb, but I could look into making it sooner if that's behind why you're asking.

We are looking to wind down the analysis phase of the project and prepare the first draft of the OpenPBTA manuscript relatively soon.

If you would like this analysis to be included in the first version of the manuscript, I would recommend filing a pull request as soon as its possible. There is the option of including this analysis for revisions. (For the Deep Review paper that @cgreene previously organized, they continued to add things during revisions beyond what reviewers asked for.)

There also may be more lag time between when a pull request is initially filed and when it is reviewed as we move our focus to writing the manuscript.

Whatever you decide works for us, but I wanted to be transparent about what the next few months of the project will look like! We'd love to include this in the first version if the timing works out.

And thank you for the heads up regarding the stranded PolyA samples; your exclusion of them in the next release works just fine for us.

release-v13-20200116 went out yesterday (#444), so if you're up to date with the AlexsLemonade master branch you'll be able to snag that by rerunning the download bash script.

@GeoffLyle also noticed there are some cell line samples in the data set that we'll want to exclude as well, which we were expecting to just implement as a filtering step.

Ah yes, sounds good. If you're looking to limit your analysis to tumor samples, the pattern that we've used throughout the project is to filter the pbta-histologies.tsv file to rows where sample_type is Tumor and composition is Solid Tissue. (Here's an example in R:

OpenPBTA-analysis/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd

Line 47 in c96bac0

dplyr::filter(sample_type == "Tumor",

.) If you further filter such that all rows are RNA-Seq in experimental_strategy, the Kids_First_Biospecimen_ID will contain all the assay identifiers you're looking for. You may have gotten to the bottom of this already, but the presence of cell lines and different identifiers tripped me up when I first started working with the OpenPBTA data ☺️

utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops.

adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration

TODO: check in a markdown viewer for typoes

@jashapiro

* [#229] comparative rnaseq - utils bugfix utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops. * [#229] comparative rna-seq adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration * [#229] comparative-rnaseq-analysis: add README.md TODO: check in a markdown viewer for typoes * [#229] comparative-rnaseq-analysis: README cleanup * [#513] comparative rnasq--address reviewer comments Addresses @jashapiro's comments in #513 . In particular: Step 01: save normalized_expression to file which I forgot to do earlier. Step 02: - ingests normalized expression file from step 1 instead of original expression file - parameters updated to reflect this - normalize_expression() no longer needed - file path handling all moved to main() - Change output file format - each cell now contains a 4-character string in a fixed order which encodes the analysis results. - update comments at top to reflect this - Output file is now a .tsv.gz instead of .rds and is included in this pull. - no longer saves thresholds file as it's not ingested anyhwere.

Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset).

…ts tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.

…s); convert scratch files to .feather (#544) * update README (TODO: proofread markdown) * implement pull 3: sample filters and general cleanup - add tsv and feather IO functions to utils - add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples - For dropped genes, note if dropped by Expression filter or Variance filter - all internal .rds files converted to .feather - all path handling in step 01 moved to main() * [#323] update CI file, dockerfile, results file for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied. * [#229] tidy up README markdown. * [#229] analyses/README update comparative-rna-seq * [#229] check MEND QC files to confirm parsed ok Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset). * [#229] maintain consistent column order in outlier results tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.

GeoffLyle · 2020-07-14T23:31:21Z

@jaclyn-taroni We want to add our cohort finding script to this analysis. As it uses the correlation matrix that is already created for finding the gene outlier status of samples, I believe it would fit into this analysis rather than being a separate proposed analysis.

The idea is to calculate the Spearman distance between a focus sample and all other samples. If the Spearman distance is above a certain threshold (typically above the 95th percentile of all Spearman distances), those samples are added to a cohort called "First-Degree Neighbors". A second cohort called "First and Second-Degree Neighbors" contains the first degree neighbors as well as those samples first degree neighbors. A third cohort is "Diseases of top 6 samples", which is a cohort containing all samples with matching diseases as the top 6 samples with the highest Spearman score.

This will involve adapting a currently used script as well as harmonizing the diseases found in the "disease_type_new" column from pbta_histologies.tsv. We believe this level of analysis will not cause any issues with the CI.

jaclyn-taroni · 2020-07-15T00:19:29Z

Hi @GeoffLyle, adding what you've described to analyses/comparative-RNASeq-analysis/ makes sense to me!

We believe this level of analysis will not cause any issues with the CI.

To confirm, you believe that it will run with 8GB RAM and in a reasonable amount of time in CI (10 minutes or less without some kind of output), is that correct?

I wanted to point you to how to obtain the files we use for CI in case you'd like to do some further testing: https://github.com/AlexsLemonade/OpenPBTA-analysis#working-with-the-subset-files-used-in-ci-locally

GeoffLyle added the proposed analysis label Nov 5, 2019

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 8, 2019

jaclyn-taroni mentioned this issue Nov 10, 2019

Planned data release: v10 #254

Closed

yuankunzhu mentioned this issue Nov 15, 2019

V10 Release #273

Merged

e-t-k mentioned this issue Dec 9, 2019

Create correlation matrices for RSEM TPM polya and stranded datasets #323

Merged

2 tasks

hbeale mentioned this issue Dec 16, 2019

Possible to run docker in cavatica on openPBTA read data? #341

Closed

jaclyn-taroni added the transcriptomic Related to or requires transcriptomic data label Jan 18, 2020

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020

[AlexsLemonade#229] comparative rnaseq - utils bugfix

1104b61

utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops.

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020

[AlexsLemonade#229] comparative rna-seq

75532bf

adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020

[AlexsLemonade#229] comparative-rnaseq-analysis: add README.md

ec77eb1

TODO: check in a markdown viewer for typoes

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020

[AlexsLemonade#229] comparative-rnaseq-analysis: README cleanup

00337b0

e-t-k mentioned this issue Feb 5, 2020

Generate outlier thresholds #513

Merged

5 tasks

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 19, 2020

[AlexsLemonade#229] tidy up README markdown.

4647ac3

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 19, 2020

[AlexsLemonade#229] analyses/README update comparative-rna-seq

9890c02

e-t-k mentioned this issue Feb 19, 2020

(Comparative RNA-Seq Analysis) Apply sample filters (QC + tumor status); convert scratch files to .feather #544

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Comparative RNA-Seq analysis #229

Proposed Analysis: Comparative RNA-Seq analysis #229

GeoffLyle commented Nov 5, 2019

jharenza commented Nov 5, 2019

GeoffLyle commented Nov 5, 2019

jharenza commented Nov 5, 2019

GeoffLyle commented Nov 6, 2019

jharenza commented Nov 7, 2019

jaclyn-taroni commented Nov 7, 2019

GeoffLyle commented Nov 7, 2019

jaclyn-taroni commented Nov 8, 2019

jharenza commented Dec 2, 2019 •

edited

Loading

GeoffLyle commented Dec 2, 2019

jharenza commented Dec 2, 2019

jaclyn-taroni commented Jan 14, 2020

e-t-k commented Jan 15, 2020 via email

jaclyn-taroni commented Jan 18, 2020

GeoffLyle commented Jul 14, 2020

jaclyn-taroni commented Jul 15, 2020

Proposed Analysis: Comparative RNA-Seq analysis #229

Proposed Analysis: Comparative RNA-Seq analysis #229

Comments

GeoffLyle commented Nov 5, 2019

Scientific goals

Proposed methods

Required input data

Proposed timeline

Relevant literature

jharenza commented Nov 5, 2019

GeoffLyle commented Nov 5, 2019

jharenza commented Nov 5, 2019

GeoffLyle commented Nov 6, 2019

jharenza commented Nov 7, 2019

jaclyn-taroni commented Nov 7, 2019

GeoffLyle commented Nov 7, 2019

jaclyn-taroni commented Nov 8, 2019

jharenza commented Dec 2, 2019 • edited Loading

GeoffLyle commented Dec 2, 2019

jharenza commented Dec 2, 2019

jaclyn-taroni commented Jan 14, 2020

e-t-k commented Jan 15, 2020 via email

jaclyn-taroni commented Jan 18, 2020

GeoffLyle commented Jul 14, 2020

jaclyn-taroni commented Jul 15, 2020

jharenza commented Dec 2, 2019 •

edited

Loading