-
Notifications
You must be signed in to change notification settings - Fork 83
Collapse rnaseq #198
Collapse rnaseq #198
Conversation
I can add this to CI @komalsrathi |
Dockerfile
Outdated
@@ -102,5 +102,8 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \ | |||
MASS \ | |||
GGally | |||
|
|||
# Required for summarizing the RNA-seq matrices | |||
RUN Rscript -e "install.packages('reshape2')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be on the container. As I am adding this to CI, I will also remove this and my assumption will get tested.
Hi @komalsrathi - thanks for this contribution! There are a couple things we usually check for when we consider merging identifiers. Could you check:
This will help to understand the effect this change will have on the data. It's helpful to understand the effect, because if one imagines a gene's expression values are randomly determined, those with more multi-mappings would have higher expression levels under this scheme. |
Hi Casey! I have not done an in-depth analyses of the correlation honestly - this is a quick and dirty way to get unique gene symbols to do enrichments against any gene sets. There are about 168 genes that have duplicated Ensembl gene identifiers. The idea is to pick one out of the many duplicates and here I am doing it by choosing the one with the maximum mean across several samples. These genes are mostly SNOs and rRNAs. |
0f05ff3
to
8563cf7
Compare
@komalsrathi : that's really helpful to know - thanks! We're going to have to describe this process in the manuscript. Could you put together a notebook that summarizes what's dropped (number, type of gene) as well as a table with the specific identifiers that got merged? I imagine a tsv file where the first column is the identifier that was merged to and any additional columns are those that were merged into that one, but if there's another standard form that you prefer to use that would be fine as well. Can you also file a PR into the manuscript repository noting this step and linking to the notebook that you produce for the analysis above? This will make sure that we're not scrambling to deal with this at the last minute or - worse - that we forget to describe it in the paper. You don't need to put the specific information into the manuscript as this will get re-run, but leave gaps linked to the notebook so that we can insert the information from the final run before the release. |
@cgreene Ok sounds good. Just to make sure - I should create an R notebook (not an R script) to tabulate the results and push it to this branch |
@komalsrathi : I was thinking that a notebook would be easier because, when it gets run over the final set of data, someone could look at it and pull the numbers out quickly. For example, if you write a section that says:
Then someone can go through and fill in the numbers and/or write about the types, etc in the methods section for the data freeze. I don't think it matters if you use a notebook or if you update this R script to create the tab separated table. The table will probably go either into the supplement or into a data freeze associated with the publication. One way to do this could be to have your |
Ahh - and - yes this would all go in the same directory. It might be |
Ok great! I like this idea - In the R script, I would generate both a collapsed matrix as well as the table output and then use the table as input in the markdown. Thanks!! |
@komalsrathi I am going to update your branch to be in sync with the master branch and add your script to continuous integration so we can test that it can be executed. This will cause commits to be added and pushed to your branch, so you should pull locally before you push any changes. You may find this StackOverflow post helpful if you have already started to make changes to your branch (I tend to use the |
I believe I know what's going on here with CircleCI: https://circleci.com/gh/AlexsLemonade/OpenPBTA-analysis/1032?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link Although the project Docker image has the |
tidyr::gather() has most of the same functionality as tidyr::pivot_longer(). |
@jaclyn-taroni I am refactoring my code so |
7d94ee5
to
7e8cee6
Compare
It looks like
I am going to update your branch to be in sync with the master branch and add both of these to continuous integration. If you force push changes to the branch it will undo these steps. Please use I've requested review from @cansavvy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @komalsrathi ! Thanks for your help! I have some broader questions/comments before I review the details of this PR:
- It would be easier for me to review this if I knew what options you ran
00-create-rsem-files
with specifically. To this end, could you add a README that explains:
- The general purpose of each script?
- What you intend the options for each to be run with?
- How each script is (or is not) dependent on the previous in the sequence. e.g. How does 00- relate to the others?
- Although you added
02-analyze-drops.Rmd
, it doesn't look like it follows what @cgreene suggested above.. Were you still planning on editing this? As it stands, I don't think02-analyze-drops.Rmd
is quite hitting what @cgreene was trying to push for.
What I think you would want is to 1) import the pre-filtered data and then 2) gather some numbers from them based on your filtering and calculation steps that you would then print out either in table form or just text. Here is the simplest example of that:
To copy @cgreene's block with modifications:
The number of identifiers that are dropped due to low expression are:
# Code block to count the number low_genes_to_rm <- which(rowSums(expr[,2:ncol(expr)]) > 0) length(low_genes_to_rm)
I am unsure how you would want to summarize the identifiers as @cgreene mentioned in : "Code block to summarize types of identifiers dropped + count of each type"
. There are a lot of ways you could do that. Maybe he or you have some ideas. But essentially you will want a code block that prints out a number and perhaps all the identifiers. That way when people look at the nb.html output from your notebook, it is easily read and ingested for them.
Note for aesthetic purposes, kable is very handy for printing out nice looking tables.
- Will the
pivot_longer
->gather
change be in an upcoming commit? Because currently I see thatpivot_longer
is currently being used and this is causing the 01-summarize_matrices.R step to fail in CircleCI.
suppressPackageStartupMessages(library(optparse)) | ||
suppressPackageStartupMessages(library(dplyr)) | ||
suppressPackageStartupMessages(library(reshape2)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an example run here just like you have for your 01-
script?
clin <- read.delim(clin, stringsAsFactors = F) | ||
polya <- clin %>% | ||
filter(experimental_strategy == "RNA-Seq" & RNA_library == "poly-A") %>% | ||
as.data.frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this step is necessary since it would have already had to have been a data.frame to be piped into dplyr::filter
? If it is necessary for a reason I'm not seeing, can you add a comment that explains why?
|
||
# read and merge RSEM genes files | ||
lfiles <- list.files(path = topDir, pattern = "*.rsem.genes.results.gz", recursive = TRUE, full.names = T) | ||
read.rsem <- function(x){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation to this function so it is more clear on what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow ok, I am still working on this and have not really pushed the final code.
Wow ok, I am still working on this and have not really pushed the final code. |
Hey guys! I will let you know when I have edited and pushed my code - I think it would make sense if you review once I am done with all the changes. Thanks. |
|
Hi @komalsrathi we saw you pushed new code and it wasn't clear to us what the status was of your current PR. No worries! Let us know when you are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this submission. I just did a bit of quick review to see why it was failing CI, and noted one other issue with the Dockerfile that is worth updating at the same time as you submit to be sure CI passes.
For ease of review, and for people who might just want to look at the results downstream, could you also include the knit output of 02-analyze-drops.Rmd
in the repository? Thanks!
.circleci/config.yml
Outdated
@@ -26,6 +26,14 @@ jobs: | |||
# command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/mutect2-vs-strelka2/01-set-up.Rmd', clean = TRUE); | |||
# rmarkdown::render('analyses/mutect2-vs-strelka2/02-analyze-concordance.Rmd', clean = TRUE)" | |||
|
|||
- run: | |||
name: Collapse RSEM | |||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -e data/pbta-gene-expression-rsem-fpkm.stranded.rds -o data/pbta-gene-expression-collapsed-rsem-fpkm.stranded.rds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -e
option here is not defined in 01-summarize_matrices.R
, causing CI to fail. Should be -i
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CI probably has the old example run. I updated all the scripts + input parameters. Example runs are on top of the R scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this file to correct this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jashapiro @jaclyn-taroni The circleci on is failing because one of my scripts uses the gencode v27 gtf file. Currently, I am just referring to it under scratch/ but haven’t pushed it because it is quite big. Should we include that in the data release? If not then how do I circumvent the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having a defined annotation file in the data download might be a good idea; we can suggest it to @jharenza for the next release. Alternatively, we do currently have annotations installed through txdb but I do not know if those would have the information you require.
If not, for now you could add the file to your analysis folder; At 40MB, it really isn't too big for github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gtf is actually 1.1 GB. Do you still think it is a good idea to put on github? If yes, I can put it under scratch/
folder and it should pass the test.
du -sh gencode.v27.primary_assembly.annotation.gtf
1.1G gencode.v27.primary_assembly.annotation.gtf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is uncompressed. If you include the gzipped version at ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.primary_assembly.annotation.gtf.gz
version it will only be 40MB. You can't put it in scratch, as that is not synced to the repository, and won't be present in the CI image.
I see you are reading the gtf with rtracklayer::import
, and while I don't know if that can read compressed files, many packages certainly can. If not, you could write a wrapper shell script that decompresses the file to scratch before running your R scripts. (You might want to do this anyway, to have only a single RUN statement in CI.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jashapiro Thanks for the idea - gzipped worked!
For now, I have the input gtf and the output tables for 01-summarize_matrices.R
under analyses/collapse-rnaseq
. It looks like finally all checks have passed so let me know once the code is reviewed and if there are any major changes to be made.
@jashapiro thanks, I just did. Can you check? |
Co-Authored-By: jashapiro <jashapiro@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution, it looks good. I have a couple of clarification comments, but nothing major. The one addition I would still like to see (in a separate PR, perhaps) is the analysis that @cgreene suggested of correlation between duplicates. While your analysis does make clear that only a small number of protein coding genes are affected, it would still be nice to see that the choice of the highest expressed duplicate is not likely to cause bias in the data as a whole.
- run: | ||
name: Collapse RSEM (polyA RNA-seq) | ||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.polya.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.polya.rds | ||
|
||
- run: | ||
name: Collapse RSEM (stranded RNA-seq) | ||
command: ./scripts/run_in_ci.sh Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i data/pbta-gene-expression-rsem-fpkm.stranded.rds -g analyses/collapse-rnaseq/gencode.v27.primary_assembly.annotation.gtf.gz -m data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds -t analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed-table.stranded.rds | ||
|
||
- run: | ||
name: Analyze RNA-seq dropped genes from both polyA and stranded data | ||
command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render(input = 'analyses/collapse-rnaseq/02-analyze-drops.Rmd', params = list(polya.annot.table = 'pbta-gene-expression-rsem-fpkm-collapsed-table.polya.rds', stranded.annot.table = 'pbta-gene-expression-rsem-fpkm-collapsed-table.stranded.rds'), clean = TRUE)" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested this in another context, but I'll put a note here too. I think it might be easier for somebody coming to this section of the repo to have a single shell script that runs the three parts here. Something like:
collapse_rsem.sh
...
Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i $polya_file
Rscript analyses/collapse-rnaseq/01-summarize_matrices.R -i $stranded_file
Rscript -e "rmarkdown::render(...)"
...
Then CI could just be the single:
- run:
name: Collapse RSEM
command: ./scripts/run_in_ci.sh bash analyses/collapse-rnaseq/collapse_rsem.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, I could do that later as I am sure this is not high priority. I would like to focus on the real question why I started doing this to begin with - immune deconvolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you should!
Someone else can pick these up once this is merged (I can't do that step).
print(dim(expr.input)) | ||
|
||
# final geneid-symbol-biotype annotation file | ||
print("Generating duplicates table...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table isn't just duplicates, is it? In some places you call it the gene annotation table, which seems a better name.
I don't think it matters much, but I did wonder if you might want to save this as a TSV for easier inspection outside R, since it doesn't seem that this has anything but text in it. Especially since you went to the trouble of converting the boolean values to "Yes/No".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table isnt just duplicates but it tells you what gene ids were kept, removed, which have duplicates gene symbols etc. I just went with the RDS convention that I have been seeing in the data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I forget, and hopefully before you move on.. would you please add a quick paragraph about this step and results to the manuscript as suggested earlier #198 (comment)
It is definitely on my to-do, unfortunately just a bit too complicated for me to understand. I don't have the bandwidth to spend too much time figuring it out on my own so I might sit down with @kgaonkar6 or @jharenza next week to add that paragraph. |
Sounds good. I just didn't want it to get lost in the shuffle. |
Please file an issue and tag yourself so it doesn't get lost.
…On Thu, Nov 7, 2019, 11:19 AM jashapiro ***@***.***> wrote:
It is definitely on my to-do, unfortunately just a bit too complicated for
me to understand. I don't have the bandwidth to spend too much time
figuring it out on my own so I might sit down with @kgaonkar6
<https://github.com/kgaonkar6> or @jharenza <https://github.com/jharenza>
next week to add that paragraph.
Sounds good. I just didn't want it to get lost in the shuffle.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#198?email_source=notifications&email_token=AAEEPM5GBDMJRNYSAGGQC5TQSQ52ZA5CNFSM4JHQG4B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDM6WPY#issuecomment-551152447>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEEPM5LXGIOFLPGKMTRXJLQSQ52ZANCNFSM4JHQG4BQ>
.
|
Can you assign AlexsLemonade/OpenPBTA-manuscript#56 to me? I don't have the permissions. |
Purpose/implementation
_What scientific question is your analysis addressing?
Collapse RNA-seq gene expression to gene symbol x sample identifier matrix
What was your approach?
For each duplicated gene symbol, take the mean across all samples, and pick the one with the maximum mean value
If this is not adding an analysis, describe your changes in this section._
Issue
What GitHub issue does your pull request address?
Directions for reviewers
Tell potential reviewers what kind of feedback you are soliciting.
Are there particular areas that need a closer look?
Just the overall logic
Is there something you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Results
If your pull request includes code that produces scientific results, please summarize the results here.
This can help facilitate discussion around interpretation.
Please state what kinds of results are included (e.g., table, figure).
Docker and continuous integration
Check all those that apply or remove this section if it is not applicable.