-
Notifications
You must be signed in to change notification settings - Fork 83
add updated collapsed rnaseq scripts #291
add updated collapsed rnaseq scripts #291
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this analysis!
I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.
My one major comment is about the location of the output files. While we will be including these files in the data release which will put them in the data/
folder, we don't really want analysis scripts to write there directly, so as to preserve that as a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a .gitignore
file in the analysis folder).
On a related note: the analyses/collapse-rnaseq/
folder still contains what appear to be the old versions of pbta-gene-expression-rsem-fpkm-collapsed-table.*
files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in the data/
folder.
Let me know if that all makes sense to you. Thanks again for your contributions!
.circleci/config.yml
Outdated
@@ -28,15 +28,15 @@ jobs: | |||
|
|||
- run: | |||
name: Collapse RSEM (polyA RNA-seq) | |||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.polya.rds | |||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds | |
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds |
We prefer that scripts not write to data/
directly, as that can introduce inconsistencies between the versions of these files that different people are using as releases occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this analysis!
I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.
My one major comment is about the location of the output files. While we will be including these files in the data release whcih will put them in the
data/
folder, we don't really want analysis scripts to write there directly, so as to preserve that a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a.gitignore
file in the analysis folder).On a related note: the
analyses/collapse-rnaseq/
folder still contains what appear to be the old versions ofpbta-gene-expression-rsem-fpkm-collapsed-table.*
files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in thedata/
folder.Let me know if that all makes sense to you. Thanks again for your contributions!
@jashapiro Thank you for getting back so quickly, and makes sense regarding not writing out to data/
, will make those changes to config.yml and 01-summarize_matrices.R.
.circleci/config.yml
Outdated
|
||
- run: | ||
name: Collapse RSEM (stranded RNA-seq) | ||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.stranded.rds | ||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds | |
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds |
# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \ | ||
# -i data/pbta-gene-expression-rsem-fpkm.polya.rds \ | ||
# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \ | ||
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \ | |
# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \ |
# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \ | ||
# -i data/pbta-gene-expression-rsem-fpkm.stranded.rds \ | ||
# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \ | ||
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \ | |
# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods look good. Lets get it merged!
Before merging, I'm going to remove the old collapse table files and commit the new ones and remove the GTF file to avoid any confusion. |
Purpose/implementation Section
What scientific question is your analysis addressing?
Is there any correlation between the Ensembl ids that are assigned the same gene symbol?
With regards to: #198 (comment)
What was your approach?
To create a 1-1 relationship between the gene symbol and Ensembl gene identifier, we needed to pick one identifier out of the many that best represented the expression of the gene. Here we do that by choosing the identifier with the maximum mean FPKM across the entire PBTA cohort.
For correlation analysis corresponding to each multi-mapped gene symbol, we computed the Pearson correlation coefficient between the expressed gene identifier that was kept (after collapsing) vs each expressed gene identifier that was discarded. We then took an average of all those coefficients and added to the drops table in the row corresponding to the expressed gene identifier that was kept.
What GitHub issue does your pull request address?
Not an issue but a discussion under this pull request: #198
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is there anything that you want to discuss further?
Any ideas to improve the analysis are welcomed.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
01-summarize_matrices.R creates two tables (one each for polyA and stranded data) in RDS format and 02-analyze-drops.Rmd outputs the tables in markdown format.
What is your summary of the results?
I did not find any striking patterns of correlations between the duplicates, the correlation coefficients were as high as 0.92 in stranded and 1.00 in polyA and as low as -0.26 in stranded and -0.19 in polyA. There were also no biotype specific patterns.
Reproducibility Checklist