add updated collapsed rnaseq scripts #291

komalsrathi · 2019-11-26T18:35:44Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Is there any correlation between the Ensembl ids that are assigned the same gene symbol?
With regards to: #198 (comment)

What was your approach?

To create a 1-1 relationship between the gene symbol and Ensembl gene identifier, we needed to pick one identifier out of the many that best represented the expression of the gene. Here we do that by choosing the identifier with the maximum mean FPKM across the entire PBTA cohort.

For correlation analysis corresponding to each multi-mapped gene symbol, we computed the Pearson correlation coefficient between the expressed gene identifier that was kept (after collapsing) vs each expressed gene identifier that was discarded. We then took an average of all those coefficients and added to the drops table in the row corresponding to the expressed gene identifier that was kept.

What GitHub issue does your pull request address?

Not an issue but a discussion under this pull request: #198

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Any ideas to improve the analysis are welcomed.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

01-summarize_matrices.R creates two tables (one each for polyA and stranded data) in RDS format and 02-analyze-drops.Rmd outputs the tables in markdown format.

What is your summary of the results?

I did not find any striking patterns of correlations between the duplicates, the correlation coefficients were as high as 0.92 in stranded and 1.00 in polyA and as low as -0.26 in stranded and -0.19 in polyA. There were also no biotype specific patterns.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

jashapiro

Thank you for adding this analysis!

I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.

My one major comment is about the location of the output files. While we will be including these files in the data release which will put them in the data/ folder, we don't really want analysis scripts to write there directly, so as to preserve that as a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a .gitignore file in the analysis folder).

On a related note: the analyses/collapse-rnaseq/ folder still contains what appear to be the old versions of pbta-gene-expression-rsem-fpkm-collapsed-table.* files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in the data/ folder.

Let me know if that all makes sense to you. Thanks again for your contributions!

jashapiro · 2019-11-26T20:38:56Z

.circleci/config.yml

@@ -28,15 +28,15 @@ jobs:

      - run:
          name: Collapse RSEM (polyA RNA-seq)
-          command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.polya.rds
+          command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds


Suggested change

command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds

command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds

We prefer that scripts not write to data/ directly, as that can introduce inconsistencies between the versions of these files that different people are using as releases occur.

Thank you for adding this analysis!

I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.

My one major comment is about the location of the output files. While we will be including these files in the data release whcih will put them in the data/ folder, we don't really want analysis scripts to write there directly, so as to preserve that a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a .gitignore file in the analysis folder).

On a related note: the analyses/collapse-rnaseq/ folder still contains what appear to be the old versions of pbta-gene-expression-rsem-fpkm-collapsed-table.* files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in the data/ folder.

Let me know if that all makes sense to you. Thanks again for your contributions!

@jashapiro Thank you for getting back so quickly, and makes sense regarding not writing out to data/, will make those changes to config.yml and 01-summarize_matrices.R.

jashapiro · 2019-11-26T20:50:12Z

.circleci/config.yml


      - run:
          name: Collapse RSEM (stranded RNA-seq)
-          command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.stranded.rds
+          command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds


Suggested change

command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds

command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds

jashapiro · 2019-11-26T20:51:28Z

analyses/collapse-rnaseq/01-summarize_matrices.R

+# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \
+# -i data/pbta-gene-expression-rsem-fpkm.polya.rds \
+# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \
+# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \


Suggested change

# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \

# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \

jashapiro · 2019-11-26T20:51:43Z

analyses/collapse-rnaseq/01-summarize_matrices.R

+# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \
+# -i data/pbta-gene-expression-rsem-fpkm.stranded.rds \
+# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \
+# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \


Suggested change

# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \

# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \

jashapiro

The methods look good. Lets get it merged!

jaclyn-taroni · 2019-12-06T15:45:28Z

Before merging, I'm going to remove the old collapse table files and commit the new ones and remove the GTF file to avoid any confusion.

Now included in download script

add updated collapsed rnaseq scripts

be876a2

jashapiro reviewed Nov 26, 2019

View reviewed changes

no output in data

f8be31c

jashapiro approved these changes Dec 6, 2019

View reviewed changes

jaclyn-taroni added 6 commits December 6, 2019 08:45

Ignore the large collapsed matrix files

de60075

Collapse table file name change

42dc0c6

Rerun analyze drops NB

062a1f3

Remove GTF from this directory

77498a2

Now included in download script

Merge remote-tracking branch 'upstream/master' into collapse-rnaseq

411491a

Use GTF in data directory

eb94c56

jaclyn-taroni merged commit aa44cc6 into AlexsLemonade:master Dec 6, 2019

komalsrathi deleted the collapse-rnaseq branch December 6, 2019 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add updated collapsed rnaseq scripts #291

add updated collapsed rnaseq scripts #291

komalsrathi commented Nov 26, 2019

jashapiro left a comment •

edited

Loading

jashapiro Nov 26, 2019

komalsrathi Nov 26, 2019

jashapiro Nov 26, 2019

jashapiro Nov 26, 2019

jashapiro Nov 26, 2019

jashapiro left a comment

jaclyn-taroni commented Dec 6, 2019

	command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
	command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds

	# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \
	# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \

add updated collapsed rnaseq scripts #291

add updated collapsed rnaseq scripts #291

Conversation

komalsrathi commented Nov 26, 2019

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

jashapiro left a comment • edited Loading

Choose a reason for hiding this comment

jashapiro Nov 26, 2019

Choose a reason for hiding this comment

komalsrathi Nov 26, 2019

Choose a reason for hiding this comment

jashapiro Nov 26, 2019

Choose a reason for hiding this comment

jashapiro Nov 26, 2019

Choose a reason for hiding this comment

jashapiro Nov 26, 2019

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Dec 6, 2019

jashapiro left a comment •

edited

Loading