PureCN writes only a fraction of the variants reported in vcf #329

GuidoLeoni · 2023-10-05T11:25:40Z

Dear Team
I ran the following pipeline to estimate the CCF of the variants reported in a mutect 2.1 VCF.
The pipeline goes well but in the "sorted_variants.csv" output are printed out only 30000 variants (almost all are germline)
I expected that all the variants be printed in the final report.
Any tips?
thank you

This is my pipeline

I/home/elisa/anaconda3/envs/musica/bin/Rscript $PURECN/IntervalFile.R --infile exome.bed --fasta /data/PIPELINE/GENOMES/hg38/hg38.fa --outfile $OUT_REF/baits_hg38_intervals.txt --offtarget --genome hg38 --export $OUT_REF/baits_optimized_hg38.bed --force;

awk 'NR==1{print}NR>1{print $0}' REF/baits_hg38_intervals.txt >REF/baits_hg38_intervals2.txt

Calculate and GC-normalize coverage from a BAM file

/home/elisa/anaconda3/envs/musica/bin/Rscript $PURECN/Coverage.R --outdir $OUT_REF --bam $INPUTDIR/EXOME/ALIGNMENT/BAM/final_uniqueTUMOR_sorted.bam --intervals $OUT_REF/baits_hg38_intervals2.txt --cores 40;

/home/elisa/anaconda3/envs/musica/bin/Rscript $PURECN/Coverage.R --outdir $OUT_REF --bam $INPUTDIR/EXOME/ALIGNMENT/BAM/final_uniqueCONTROL_sorted.bam --intervals $OUT_REF/baits_hg38_intervals2.txt --cores 40;
wait;

java $temp -jar $gatk Mutect2 -R $GENOMEDIR/hg38.fa -I $INPUTDIR/EXOME/ALIGNMENT/BAM/final_uniqueTUMOR_sorted.bam -tumor TUMOR -I $INPUTDIR/EXOME/ALIGNMENT/BAM/final_uniqueCONTROL_sorted.bam -normal CONTROL -O mutect.vcf --genotype-germline-sites true --genotype-pon-sites true --native-pair-hmm-threads 50 --germline-resource /data/PIPELINE/GENOMES/hg38/somatic-hg38_af-only-gnomad.hg38.vcf.gz --minimum-allele-fraction 0.10

/home/elisa/anaconda3/envs/musica/bin/Rscript $PURECN/PureCN.R --out $OUT_REF --tumor $OUT_REF/final_uniqueTUMOR_sorted_coverage_loess.txt.gz --normal $OUT_REF/final_uniqueCONTROL_sorted_coverage_loess.txt.gz --sampleid final_uniqueTUMOR_sorted --vcf mutect.vcf --intervals $OUT_REF/baits_hg38_intervals2.txt --genome hg38

this is my sessionInfo

sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/elisa/anaconda3/envs/musica/lib/libopenblasp-r0.3.21.so

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] PureCN_1.20.0 VariantAnnotation_1.36.0
[3] Rsamtools_2.6.0 Biostrings_2.58.0
[5] XVector_0.30.0 SummarizedExperiment_1.20.0
[7] Biobase_2.50.0 GenomicRanges_1.42.0
[9] GenomeInfoDb_1.26.7 IRanges_2.24.1
[11] S4Vectors_0.28.1 MatrixGenerics_1.2.1
[13] matrixStats_1.0.0 BiocGenerics_0.36.1
[15] DNAcopy_1.64.0

loaded via a namespace (and not attached):
[1] httr_1.4.7 VGAM_1.1-9 bit64_4.0.5
[4] splines_4.0.5 askpass_1.2.0 BiocFileCache_1.14.0
[7] blob_1.2.4 BSgenome_1.58.0 GenomeInfoDbData_1.2.4
[10] progress_1.2.2 pillar_1.9.0 RSQLite_2.3.1
[13] lattice_0.21-9 glue_1.6.2 RColorBrewer_1.1-3
[16] colorspace_2.1-0 Matrix_1.6-1.1 XML_3.99-0.14
[19] pkgconfig_2.0.3 biomaRt_2.46.3 zlibbioc_1.36.0
[22] scales_1.2.1 BiocParallel_1.24.1 tibble_3.2.1
[25] openssl_2.1.1 generics_0.1.3 ggplot2_3.4.3
[28] cachem_1.0.8 GenomicFeatures_1.42.3 cli_3.6.1
[31] magrittr_2.0.3 crayon_1.5.2 memoise_2.0.1
[34] fansi_1.0.4 xml2_1.3.5 tools_4.0.5
[37] data.table_1.14.8 prettyunits_1.2.0 hms_1.1.3
[40] formatR_1.14 lifecycle_1.0.3 stringr_1.5.0
[43] Rhdf5lib_1.12.1 munsell_0.5.0 DelayedArray_0.16.3
[46] AnnotationDbi_1.52.0 lambda.r_1.2.4 compiler_4.0.5
[49] rlang_1.1.1 rhdf5_2.34.0 futile.logger_1.4.3
[52] grid_4.0.5 RCurl_1.98-1.12 rhdf5filters_1.2.1
[55] rappdirs_0.3.3 bitops_1.0-7 gtable_0.3.4
[58] DBI_1.1.3 curl_4.3.2 R6_2.5.1
[61] gridExtra_2.3 GenomicAlignments_1.26.0 dplyr_1.1.3
[64] rtracklayer_1.50.0 fastmap_1.1.1 bit_4.0.5
[67] utf8_1.2.3 futile.options_1.0.1 stringi_1.7.12
[70] vctrs_0.6.3 dbplyr_2.3.4 tidyselect_1.2.0

lima1 · 2023-10-09T14:42:42Z

Hi, probably also #320 related? Try installing the latest GitHub version and re-run PureCN.R with --min-base-quality 20.

lima1 · 2023-10-09T14:46:00Z

Also, for test runs using --normal in PureCN.R is ok, but for ideal results, create a normal database with NormalDB.R and then NOT use --normal. This will reduce the noise by a factor of approx. 2. You can use our Docker image that can read GATK GenomicsDB files for the mapping bias database (that one provides the expected allelic fractions of all heterozygous SNPs in your normal samples and adjusts accordingly).

lima1 · 2023-11-13T16:29:18Z

Closing now, feel free to reopen if unrelated to #320

GuidoLeoni · 2023-12-14T13:22:12Z

Dear Markus
I'm still struggling with my issue(after a long break).
The PureCN.R script still print out only ~ 20000 variants marked as germline in _variants.csv file. I miss all the somatic variants in my mutect2 vcf .

I tried to lower the --min-base-quality to 20 but this didn't improve the results
Surely I'm missing something but I don't figure out exactly what

GuidoLeoni · 2023-12-15T08:08:12Z

In case this may be of help
the log file tells me that almost all the somatic variants were removed because not in het in the panel of normal .

INFO [2023-10-24 08:55:04] Loading coverage files...
INFO [2023-10-24 08:55:08] Mean target coverages: 240X (tumor) 235X (normal).
INFO [2023-10-24 08:55:10] Mean coverages: chrX: 113.73, chrY: 120.12, chr1-22: 228.96.
INFO [2023-10-24 08:55:11] Mean coverages: chrX: 121.64, chrY: 120.12, chr1-22: 227.41.
INFO [2023-10-24 08:55:26] Removing 3516 intervals with missing log.ratio.
INFO [2023-10-24 08:55:26] Removing 21 low/high GC targets.
INFO [2023-10-24 08:55:27] Removing 6377 intervals excluded in normalDB.
INFO [2023-10-24 08:55:27] normalDB provided. Setting minimum coverage for segmentation to 0.0015X.
INFO [2023-10-24 08:55:27] Removing 61 low count (< 100 total reads) intervals.
INFO [2023-10-24 08:55:27] Removing 8523 low coverage (< 0.0015X) intervals.
WARN [2023-10-24 08:55:27] Low number of off-target intervals. You might want to exclude them (209557 on-target, 21660 off-target, ratio 0.09).
INFO [2023-10-24 08:55:27] Using 231217 intervals (209557 on-target, 21660 off-target).
INFO [2023-10-24 08:55:27] Ratio of mean on-target vs. off-target read counts: 0.19
INFO [2023-10-24 08:55:27] Mean off-target bin size: 96126
INFO [2023-10-24 08:55:29] AT/GC dropout: 1.03 (tumor), 1.01 (normal), 1.01 (coverage log-ratio).
INFO [2023-10-24 08:55:29] Loading VCF...
INFO [2023-10-24 08:57:03] Found 3115993 variants in VCF file.
INFO [2023-10-24 08:57:09] Removing 69933 triallelic sites.
WARN [2023-10-24 08:57:11] Found GERMQ info field with Phred scaled germline probabilities.
INFO [2023-10-24 08:57:20] Maximum of POPAP INFO is > 1, assuming -log10 scaled values
WARN [2023-10-24 08:57:25] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead.
INFO [2023-10-24 08:57:29] 1951294 (64.1%) variants annotated as likely germline (DB INFO flag).
WARN [2023-10-24 08:58:37] Found 590496 variants with missing allelic fraction starting with chr1:10439_AC/A. Removing them.
INFO [2023-10-24 08:59:06] TUMOR is tumor in VCF file.
INFO [2023-10-24 08:59:22] 1792 homozygous and 1547 heterozygous variants on chrX.
INFO [2023-10-24 08:59:22] Sex from VCF: NA (Fisher's p-value: < 0.0001, odds-ratio: 3.63).
INFO [2023-10-24 08:59:22] Detected MuTect2 VCF.
INFO [2023-10-24 08:59:37] Removing 0 Mutect2 calls due to blacklisted failure reasons.
INFO [2023-10-24 08:59:42] Removing 2053349 non heterozygous (in matched normal) germline SNPs.
INFO [2023-10-24 08:59:42] Base quality scores range from -1 to 37 (offset by 1)
INFO [2023-10-24 08:59:43] Minimum number of supporting reads ranges from 1 to 41, depending on coverage and BQS.
INFO [2023-10-24 09:01:09] Initial testing for significant sample cross-contamination: unlikely
INFO [2023-10-24 09:01:11] Removing 209124 variants with AF < 0.030 or AF >= 1.000 or insufficient supporting reads or depth < 15.
INFO [2023-10-24 09:01:11] Total size of targeted genomic region: 34.90Mb (53.62Mb with 50bp padding).
INFO [2023-10-24 09:01:14] 11.1% of targets contain variants.
INFO [2023-10-24 09:01:14] Removing 166082 variants outside intervals.
INFO [2023-10-24 09:01:14] Found SOMATIC annotation in VCF.

lima1 closed this as completed Nov 13, 2023

GuidoLeoni mentioned this issue Dec 15, 2023

In case this may be of help #338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PureCN writes only a fraction of the variants reported in vcf #329

PureCN writes only a fraction of the variants reported in vcf #329

GuidoLeoni commented Oct 5, 2023

lima1 commented Oct 9, 2023

lima1 commented Oct 9, 2023

lima1 commented Nov 13, 2023

GuidoLeoni commented Dec 14, 2023 •

edited

Loading

GuidoLeoni commented Dec 15, 2023

PureCN writes only a fraction of the variants reported in vcf #329

PureCN writes only a fraction of the variants reported in vcf #329

Comments

GuidoLeoni commented Oct 5, 2023

Calculate and GC-normalize coverage from a BAM file

lima1 commented Oct 9, 2023

lima1 commented Oct 9, 2023

lima1 commented Nov 13, 2023

GuidoLeoni commented Dec 14, 2023 • edited Loading

GuidoLeoni commented Dec 15, 2023

GuidoLeoni commented Dec 14, 2023 •

edited

Loading