Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add updated collapsed rnaseq scripts #291

Merged
merged 8 commits into from
Dec 6, 2019

Conversation

komalsrathi
Copy link
Collaborator

Purpose/implementation Section

What scientific question is your analysis addressing?

Is there any correlation between the Ensembl ids that are assigned the same gene symbol?
With regards to: #198 (comment)

What was your approach?

To create a 1-1 relationship between the gene symbol and Ensembl gene identifier, we needed to pick one identifier out of the many that best represented the expression of the gene. Here we do that by choosing the identifier with the maximum mean FPKM across the entire PBTA cohort.

For correlation analysis corresponding to each multi-mapped gene symbol, we computed the Pearson correlation coefficient between the expressed gene identifier that was kept (after collapsing) vs each expressed gene identifier that was discarded. We then took an average of all those coefficients and added to the drops table in the row corresponding to the expressed gene identifier that was kept.

What GitHub issue does your pull request address?

Not an issue but a discussion under this pull request: #198

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Any ideas to improve the analysis are welcomed.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

01-summarize_matrices.R creates two tables (one each for polyA and stranded data) in RDS format and 02-analyze-drops.Rmd outputs the tables in markdown format.

What is your summary of the results?

I did not find any striking patterns of correlations between the duplicates, the correlation coefficients were as high as 0.92 in stranded and 1.00 in polyA and as low as -0.26 in stranded and -0.19 in polyA. There were also no biotype specific patterns.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this analysis!

I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.

My one major comment is about the location of the output files. While we will be including these files in the data release which will put them in the data/ folder, we don't really want analysis scripts to write there directly, so as to preserve that as a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a .gitignore file in the analysis folder).

On a related note: the analyses/collapse-rnaseq/ folder still contains what appear to be the old versions of pbta-gene-expression-rsem-fpkm-collapsed-table.* files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in the data/ folder.

Let me know if that all makes sense to you. Thanks again for your contributions!

@@ -28,15 +28,15 @@ jobs:

- run:
name: Collapse RSEM (polyA RNA-seq)
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.polya.rds
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds

We prefer that scripts not write to data/ directly, as that can introduce inconsistencies between the versions of these files that different people are using as releases occur.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this analysis!

I was surprised at first that there were such low correlations, but I would guess that those are for genes that are very lowly expressed, and therefore most of the variation we are seeing in them is noise. I looked briefly to see if there was much difference if we switched to Spearman correlation, and it might improve some correlations a bit, but not a drastic change.

My one major comment is about the location of the output files. While we will be including these files in the data release whcih will put them in the data/ folder, we don't really want analysis scripts to write there directly, so as to preserve that a "protected" location for standardized files. If you write the output files to the analysis folder (or a subfolder, as discussed here), you don't necessarily have to include all of them in the tracked repo (either by just not adding them or making a .gitignore file in the analysis folder).

On a related note: the analyses/collapse-rnaseq/ folder still contains what appear to be the old versions of pbta-gene-expression-rsem-fpkm-collapsed-table.* files? If so, it would be good to update those with the latest versions in this PR. You can also delete the gtf file from this folder, as are now using the version in the data/ folder.

Let me know if that all makes sense to you. Thanks again for your contributions!

@jashapiro Thank you for getting back so quickly, and makes sense regarding not writing out to data/, will make those changes to config.yml and 01-summarize_matrices.R.


- run:
name: Collapse RSEM (stranded RNA-seq)
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.stranded.rds
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g data/gencode.v27.primary_assembly.annotation.gtf.gz -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds

# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \
# -i data/pbta-gene-expression-rsem-fpkm.polya.rds \
# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \
# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds \

# Rscript analyses/collapse-rnaseq/01-summarize_matrices.R \
# -i data/pbta-gene-expression-rsem-fpkm.stranded.rds \
# -g data/gencode.v27.primary_assembly.annotation.gtf.gz \
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \
# -m analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds \

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods look good. Lets get it merged!

@jaclyn-taroni
Copy link
Member

Before merging, I'm going to remove the old collapse table files and commit the new ones and remove the GTF file to avoid any confusion.

@jaclyn-taroni jaclyn-taroni merged commit aa44cc6 into AlexsLemonade:master Dec 6, 2019
@komalsrathi komalsrathi deleted the collapse-rnaseq branch December 6, 2019 18:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants