-
Notifications
You must be signed in to change notification settings - Fork 83
Proposed Analysis: Comparative RNA-Seq analysis #229
Comments
hi @GeoffLyle! Thanks for proposing this analysis! We currently have RSEM FPKM values available in our data release: |
@jharenza I believe our analysis could work with the FPKM or Kallisto TPM values. However, our current workflow is set up tot use RSEM TPM values and our learning model was trained on TPM data. Since our team is much more comfortable working with RSEM TPM, it would be great if that data could be provided. By the end of next week works perfectly well for our timeline. Thanks! |
OK, will add to our sprint, thanks! |
Hi @jharenza, I talked with our bioinformatician and she informed me that we are interested in both Transcript Level and Gene Level RSEM TPM. Would it possible to generate both of these files during your sprint? |
@GeoffLyle will do both! |
Hi @GeoffLyle I'm looking forward to these analyses. Thank you for proposing them and thanks @jharenza for your team's help with the RSEM TPM. I wanted to ask about how you were planning to split up the analyses for submission. Would each bullet point under scientific goals be it's own pull request or will some steps be grouped together? I also wanted to note that we are planning to flesh out the molecular subtype information (#19) which may be of scientific interest here. If so, the column |
@jaclyn-taroni Our submission of analyses will follow the scientific goals bullet points. Our first analysis will definitely be Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data. The Generate gene outlier thresholds for ribodeplete samples. and List outlier genes for each ribodeplete sample. may be two separate analyses or combined into one pull request depending on the speed and ease of engineering our algorithm to work with the data. The Discover trends in outlier genes by tumor subtype. and Investigate possible targeted therapeutics based on outlier expression profile. are broad goals we are still determining how to implement. Typically this analysis is done on a N-of-1 patient sample so I will have to explore the data to determine what trends there are and how best to present them. Also we have some modeling software our graduate students have been working on that we would like to adapt to the Open-PBTA data and submit as an analysis. In short, the first 3 scientific goals will be submitted as 2 or 3 pull requests. The last two scientific goals may be submitted in 2+ pull requests. Thanks for the heads up on the |
Sounds good @GeoffLyle, thank you! If you have any questions that pop up as you're implementing the correlation matrices step, please let us know. |
Hi @GeoffLyle - hopefully you have been following, but the TPM data you requested became available with release V10 (#273). Excited to see your PRs come through! |
@jharenza Thank you! Saw the V10 release. I'll let the team know about the V11 release. Should be a good test to ensure everything still runs correctly. |
@GeoffLyle ahh spoke too soon and corrected it. @jaclyn-taroni is running V11 tests currently, then we will merge once everything works properly :). |
Hi @GeoffLyle and @e-t-k, I wanted to check in. Do you have an idea of when you expect to file a pull request for the outlier thresholds and outlier gene lists? Also, if you are working with the |
Hi @jaclyn-taroni ,
I've been working on some other priorities recently, so current estimate
for this is early Feb, but I could look into making it sooner if that's
behind why you're asking.
And thank you for the heads up regarding the stranded PolyA samples; your
exclusion of them in the next release works just fine for us. @GeoffLyle
also noticed there are some cell line samples in the data set that we'll
want to exclude as well, which we were expecting to just implement as a
filtering step.
|
Hi @e-t-k, Thanks for your reply!
We are looking to wind down the analysis phase of the project and prepare the first draft of the OpenPBTA manuscript relatively soon. If you would like this analysis to be included in the first version of the manuscript, I would recommend filing a pull request as soon as its possible. There is the option of including this analysis for revisions. (For the Deep Review paper that @cgreene previously organized, they continued to add things during revisions beyond what reviewers asked for.) There also may be more lag time between when a pull request is initially filed and when it is reviewed as we move our focus to writing the manuscript. Whatever you decide works for us, but I wanted to be transparent about what the next few months of the project will look like! We'd love to include this in the first version if the timing works out.
Ah yes, sounds good. If you're looking to limit your analysis to tumor samples, the pattern that we've used throughout the project is to filter the OpenPBTA-analysis/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd Line 47 in c96bac0
RNA-Seq in experimental_strategy , the Kids_First_Biospecimen_ID will contain all the assay identifiers you're looking for. You may have gotten to the bottom of this already, but the presence of cell lines and different identifiers tripped me up when I first started working with the OpenPBTA data |
utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops.
adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration
TODO: check in a markdown viewer for typoes
* [#229] comparative rnaseq - utils bugfix utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops. * [#229] comparative rna-seq adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration * [#229] comparative-rnaseq-analysis: add README.md TODO: check in a markdown viewer for typoes * [#229] comparative-rnaseq-analysis: README cleanup * [#513] comparative rnasq--address reviewer comments Addresses @jashapiro's comments in #513 . In particular: Step 01: save normalized_expression to file which I forgot to do earlier. Step 02: - ingests normalized expression file from step 1 instead of original expression file - parameters updated to reflect this - normalize_expression() no longer needed - file path handling all moved to main() - Change output file format - each cell now contains a 4-character string in a fixed order which encodes the analysis results. - update comments at top to reflect this - Output file is now a .tsv.gz instead of .rds and is included in this pull. - no longer saves thresholds file as it's not ingested anyhwere.
Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset).
…ts tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.
…s); convert scratch files to .feather (#544) * update README (TODO: proofread markdown) * implement pull 3: sample filters and general cleanup - add tsv and feather IO functions to utils - add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples - For dropped genes, note if dropped by Expression filter or Variance filter - all internal .rds files converted to .feather - all path handling in step 01 moved to main() * [#323] update CI file, dockerfile, results file for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied. * [#229] tidy up README markdown. * [#229] analyses/README update comparative-rna-seq * [#229] check MEND QC files to confirm parsed ok Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset). * [#229] maintain consistent column order in outlier results tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.
@jaclyn-taroni We want to add our cohort finding script to this analysis. As it uses the correlation matrix that is already created for finding the gene outlier status of samples, I believe it would fit into this analysis rather than being a separate proposed analysis. The idea is to calculate the Spearman distance between a focus sample and all other samples. If the Spearman distance is above a certain threshold (typically above the 95th percentile of all Spearman distances), those samples are added to a cohort called "First-Degree Neighbors". A second cohort called "First and Second-Degree Neighbors" contains the first degree neighbors as well as those samples first degree neighbors. A third cohort is "Diseases of top 6 samples", which is a cohort containing all samples with matching diseases as the top 6 samples with the highest Spearman score. This will involve adapting a currently used script as well as harmonizing the diseases found in the "disease_type_new" column from |
Hi @GeoffLyle, adding what you've described to
To confirm, you believe that it will run with 8GB RAM and in a reasonable amount of time in CI (10 minutes or less without some kind of output), is that correct? I wanted to point you to how to obtain the files we use for CI in case you'd like to do some further testing: https://github.com/AlexsLemonade/OpenPBTA-analysis#working-with-the-subset-files-used-in-ci-locally |
Scientific goals
Proposed methods
Note: This process is the precursor to creating a TumorMap (https://tumormap.ucsc.edu/). While programmatic creation of a TumorMap is not currently open access, we can add instructions for users on how to manually generate a TumorMap on the website.
Required input data
pbta-gene-expression-kallisto.polya.rds
pbta-gene-expression-kallisto.stranded.rds
Is it possible to get RSEM TPM values for this data?
That would make it easier for us to adapt existing algorithms to work with the Open-PBTA data.
Proposed timeline
8 weeks
Relevant literature
Method published in [Vaske et al. Jama Open Network. 2019.] (https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2753519)
The text was updated successfully, but these errors were encountered: