Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using salmon quantification with DeSeq2 #437

Open
thkitapci opened this issue Oct 22, 2019 · 17 comments
Open

Using salmon quantification with DeSeq2 #437

thkitapci opened this issue Oct 22, 2019 · 17 comments

Comments

@thkitapci
Copy link

Hi,
I run salmon then used quantmerge to combine the results as

salmon quantmerge --quants cat list_of_quant_folders--column numreads -o Merged_quants.txt

Can I use this as input for the DeSeq2 analysis ? One problem is "numreads" column is not integer and DeSeq2 requires integer input (read counts) can I convert the numbers in this column to integer then use as DeSeq2 input?

Thanks
Hamdi

@roryk
Copy link
Contributor

roryk commented Oct 22, 2019

Hi Hamdi,

http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html has a really great description of how to use the output from Salmon with DESeq2.

@thkitapci
Copy link
Author

Hi @roryk
I know about tximport but is there any way to generate the input data for DESeq2 without using R? I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.

If it is not possible and I have run R to get the count matrix for DEseq2, I can figure out a way to do it.

DESeq2 input file is a simple matrix of counts and "salmon quantmerge" already generates this, can you please explain to me why an external library is required ? Is there something I am missing that tximport package is doing to the data? Does tximport takes into account gene lengths or library size to generate the output?

Thanks
Best Regards
Hamdi

@tamuanand
Copy link

Hi Hamdi

Bit confused about your logic here - why would you not want to use tximport in R when your next step (DESeq2) is still going to be R?

I am curious to know your reasoning

I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.

@DustinChen1986
Copy link

This is a really torturous step for none R knowledge to use tximport for data transmission to DESeq2. Main problem is that the count matrix for DESeq2, which can be easily prepared by Python, must be integer, but tximport did not explain how to deal with the decimals.

@qins
Copy link

qins commented Dec 25, 2020

I also wonder why salmon not output original reads counts

update:

Because it is more accurate.

DeSeq2 should accept it.

As for now, maybe we could simply round to the nearest integer

NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

@rob-p
Copy link
Collaborator

rob-p commented Dec 25, 2020

If you can already use DESeq2, then using tximport should not make it any harder at all. Given the tximport data, getting it into DESeq2 is as easy as

dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

as shown in the tximport vignette.

Regarding outputting "original read counts"; salmon does output the estimates for the number of reads deriving from each transcript. If the question is, why is this number not an integer, that's because the best estimate (the maximum likelihood estimate) is often not integral. Tools that simply count reads (e.g. HTSeq) produce integer counts, but these are in no way "original read counts" for the corresponding genes, and are usually less accurate (farther from the true number of fragments deriving from a transcript / gene) than the estimates produced by salmon. The fact that the best estimate is often not an integer is a direct result of the fact one is considering a statistical model and taking expectations.

@qins
Copy link

qins commented Dec 25, 2020

Still maybe it's better to have an integer version of read counts file.
There are cloud services and they do not accept non-integer read counts.

@rob-p
Copy link
Collaborator

rob-p commented Dec 25, 2020

You mean like cloud services to perform the DE analysis? It’s always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that it’s worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.

@authentic-zz
Copy link

I am also confused about how to use salmon quantification for DeSeq2, because my further use of DeSeq2 is on an online tool, not r. I would be very grateful if someone could answer my doubts.

@DustinChen1986
Copy link

Hi authentic-zz:
I have succeeded to use tximport package to transmit salmon results to DESeq2.
Now I have 6 salmon results named T11.sf, T12.sf, T13.sf,T21.sf,T22.sf,T23.sf. And I have the transcripts and genes relationship table file trans2gene.csv (each line have one transcript and gene separated by tab).
library(DESeq2)
library("tximport")
library("readr")
tx2gene <- read_csv("trans2gene.csv")
samples <- data.frame(run = c('T11', 'T12', 'T13', 'T21', 'T22', 'T23'), condition = c("T1","T1","T1","T2","T2","T2"))
files <- file.path('.', paste(substring(samples$run, 1,3),".sf", sep = "")) # set salmon result files, each sample names plus '.sf'
names(files) <- samples$run
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
head(txi$counts)
dds <- DESeqDataSetFromTximport(txi, colData=samples, design= ~ condition)
dds <- DESeq(dds)
res <- results(dds)
write.table(res,"T1_2.csv", sep = ",", row.names = TRUE)

Hope give you some help.

@authentic-zz
Copy link

Hi DustinChen1986:
Thank you very much for your useful help, but the problem I am currently experiencing is rather special.
The data in my hand is quantified by salmon and sorted into a text file (row name is the ID of the gene, and the column name is the sample name), which seems to be different from the file input by tximport package, which is a troublesome thing.
If you have a corresponding solution, I would be very grateful.

@DustinChen1986
Copy link

To authentic-zz:
You mean you have get the counts into a data frame. I don't know how to transfer the salmon data frame into DESeq2,and I fail to handle data frame from salmon, too. So I have these recommendations to you:

  1. Get the salmon raw results, the sf files, as the tximport input files.
  2. Run the salmon again or choose other pipeline like HISAT2-Stringtie/featurecounts/HTseq.
  3. Try to generate the salmon sf file. The sf file's row name is "Name Length EffectiveLength TPM NumReads". You have the gene ID and counts, so fill the length, effect length, and TPM by tab or something. But I do not sure if this handle is correct. https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats

@Yang-amd
Copy link

@DustinChen1986
Hello DustinChen1986
Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csv“?
Thank you for your patient.
Yang RT

@DustinChen1986
Copy link

@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csv“? Thank you for your patient. Yang RT

The first column is the transcript's names, the second column is the gene's names, and the header is required

@Yang-amd
Copy link

@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csv“? Thank you for your patient. Yang RT

The first column is the transcript's names, the second column is the gene's names, and the header is required

Thank you very much,it helps me a lot! @DustinChen1986

@caodudu
Copy link

caodudu commented May 4, 2023

You mean like cloud services to perform the DE analysis? It’s always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that it’s worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.

I noticed that now salmon can export the quant.gene.sf file if I add the parameters"-g xx.gtf". What's difference between this file and the result of tximport? Can I use the result to replace tximport?

@rob-p
Copy link
Collaborator

rob-p commented May 4, 2023

It is possible to output gene counts directly, but using tximport is the preferred and officially supported method. The reason for this is that the gene-level aggregation built into salmon is per-sample, that is each sample is aggregated to the gene level independently. On the other hand, tximport considers all samples in an experiment to determine e.g. the average expressed length of a gene over all samples. This leads to better aggregation. Likewise, tximport (specifically via tximeta) provides a nice interface to track and propagate reference annotation provenance, which ultimately leads to more reproducible analyses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants