-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Effect of --max-copy-number parameter on purity/ploidy and copy number estimates #321
Comments
Hi Dmitry, thanks a lot for your effort here! I indeed struggled a bit with the default for --max-copy-number. Non-focal gains with more than 7 copies are very rare (they do happen though!). 8 is indeed special in that it is even, that means it can explain segments with balanced ~0.5 allelic fraction SNPs. That in turns means you give the search space an additional high ploidy solution to explain the data, assuming many SNPs are still balanced. In high quality, high purity samples, you might still want to include the 8 because they do happen, but be aware that it increases the search space more than you might think. You are right, I should make this more clear in the --help section. I should say the math to punish more complex solutions could be improved. It's kind of a mess because we have the SNPs and log ratio fitting and then also support sub-clonal gains. It's on my TODO to clean this up, but I assume that's partly what you are seeing. It's too cheap for the model to pick a high complex solution. |
Hi Markus, I was also confused by the following:
It seems that "dbSNP membership" and similar is used other places in the documentation and comments instead of "membership in germline databases". In my case, DB is determined from POPAF, which in turn is annotated from gnomAD. Took me a while to understand that POPAF has priority over dbSNP, so the description was misleading regardless of whether dbSNP annotations were available or not.
Shouldn't "> 2" by default be "> 6", as in the new default value of --min-cosmic-cnt? I will create a pull request with proposed changes to the documentation, please have a look whether I got it right. |
Hi @ddrichel , since you have done such a nice simulation benchmark, I thought I'll reach out to you first before I figure out how to do that. Looking for getting some simulated male samples with low-level gains of chrX genes (AR for example). Did you use bam surgeon (ah, see you use syngen) or something and can share a few scripts by any chance? No worries obviously if not! |
Hi Markus, I will be happy to help as far as I can. This was work I have done as an independent contractor for a client, but chances are good that I will get permission to share with you privately. I will reach out to you via e-mail. In summary, we generated WES fastq files with simulated CNVs using synggen. This is more or less straightforward. |
I am running simulations (whole-exome, bam files generated with synggen 1.5, 1 tumor and 2 normals for NormalDB, loess normalization, simulated purity=0.8, expected ploidy~2.8) for PureCN benchmarking, and I was confused by the effect of --max-copy-number, specifically why the copy numbers are off by roughly a factor of 2 if the --max-copy-number is too high. I believe I have figured it out, so this is more of a question whether my understanding is correct, and if it is, a suggestion to improve the part in the documentation that I found to be confusing:
"Defaults are well calibrated and should produce close to ideal results for most samples. A few common cases where changing defaults makes sense:
High purity and high quality: For cancer types with a high expected purity, such as ovarian cancer, AND when quality is expected to be very good (high coverage, young samples), --max-copy-number 8. (PureCN reports copy numbers greater than this value, but will stop fitting SNP allelic fractions to the exact allele-specific copy number because this will get impossible very quickly with high copy numbers - and computationally expensive.)"
The experimental setup is:
My log2 ratio spectrum from *segmentation.pdf:
Plot of best estimates for ploidy/purity for different maxCN:
Plot of CN in simulated truth vs. PureCN output at --max-copy-number 12, segments in truth and PureCN output are matched with bedtools intersect -f 0.5 (reciprocal overlap 0.5):
It seems that at maxCN=12, unreliable (due to noise) high raw log2 ratios in the purity/ploidy search lead to the wrong optimal solution, scaling the CN in the output by ~2x - which intuitively makes sense, as scaling the whole spectrum of log2-ratios by 2x should provide the next-best fit, at lower estimated purity. At --max-copy-number 13 and 14, the newly included, high, but less noisy log2 ratios happen to outweigh the noise, leading to correct global purity/estimates again.
Thanks in advance
Dmitriy
The text was updated successfully, but these errors were encountered: