-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi region curation #362
Comments
Hi @jberg1999 , unfortunately the joint segmentation of multiple samples is currently not available because the copynumber package was removed from Bioconductor. But will get back to this soon, GATK supports this. ASCAT has a function for this too, but this would take a bit more work getting a wrapper done. This will reduce the artifactual differences between regions and thus hopefully reduces the requirement for manual curation. Having multiple regions makes the curation easier. Not sure the score helps, but visually checking which one is more likely should be easy (with a bit of experience) for most. Feel free to post the B-allele frequency plot for a few pairs you are not sure. |
Thank you very much for your quick response! I think for now I will stick with PureCN's segmentation for now and just focus on the curation aspect. I am always weary of manual curation because it can easily lead to bias based on the curator's own preconceived notions of what the results should look like. I would ideally like to break down curation into some common sense rules that I can then systematically apply to each sample or group of samples so that I can be very transparent about the process. I have been looking at #310 and some of the other issues and it seems like for any individual sample we switch off of a high ploidy solution based on the following.
Are there any other things that you look for in deciding between higher and lower ploidy solutions? I am not sure what would exclude a low ploidy solution other than extensive LOH. I might show a couple of tricky ones but I would have to ask my collaborators first. |
That's a good starting point - if you are lucky and have multiple balanced segments with differing log ratio, it's usually pretty clear what is correct. With a bit of experience, most samples are quite obvious even without those balanced segments. It can become more difficult (i.e. ambiguous) in cancer types with lots of sub-clonal alterations (see the ABSOLUTE paper, NSCLC is probably the worst here). Without too many sub-clonal alterations, low ploidy solutions usually just have a single dominating log-ratio peak around 0 and then small peaks for the gains and losses. High ploidy solutions have multiple peaks generated after the genome duplications and consecutive gains and losses. If you need to curate lots of samples, like more than 15%, you might be able to tweak your PureCN setup to get better results out-of-the-box. In my benchmarks, I hit a plateau over 90% accuracy for ploidy. It's a whack-a-mole thing where every fix then introduces another issue. Improving the sub-clonal calling could help a bit, and I started working on it a while ago but getting distracted by other things. Wouldn't be probably too difficult to build a Deep Learning model on the B-allele frequency plot, but no plans for that right now. |
Hi Markus, Thank you for your help with this! I was able to curate my samples reasonably well based of this discussion. For now I think we can close this issue. If something else related to curation comes up I may reopen it. |
Hi Markus,
Thank you for creating such a useful and transparent copy number caller!
I am working on a dataset with multiple regions per patient, and have run pureCN on each sample. I have found that the in most cases there is pretty strong agreement between regions of the same tumor, but in other cases there are a lot of conserved breakpoints but the actual copy number of the breakpoints is ploidy shifted (for example balanced diploid vs balanced tetraploid). Do you think it is advisable to do manual curation across regions of the same tumor? If so, do you have any recommendations for how to approach this? I was thinking maybe mapping the solutions of each region to their equivalent solutions in other regions (either by ploidy or by the actual segmentation) and then determining which set of similar solutions has the highest overall score across all regions. Do you think this makes sense?
The text was updated successfully, but these errors were encountered: