Skip to content

Commit

Permalink
Merge pull request #325 from ddrichel/issue_321
Browse files Browse the repository at this point in the history
Updated documentation and comments: --max-allele-counts, dbSNP member…
  • Loading branch information
lima1 authored Dec 7, 2023
2 parents 6328404 + 1bc83a4 commit a41f054
Show file tree
Hide file tree
Showing 11 changed files with 52 additions and 50 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,4 @@ biocViews: CopyNumberVariation, Software, Sequencing,
VariantAnnotation, VariantDetection, Coverage, ImmunoOncology
NeedsCompilation: no
ByteCompile: yes
RoxygenNote: 7.2.3
RoxygenNote: 7.2.3.9000
12 changes: 6 additions & 6 deletions R/filterVcf.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#' which potentially have allelic fractions smaller than 1 due to artifacts or
#' contamination. If a matched normal is available, this value is ignored,
#' because homozygosity can be confirmed in the normal.
#' @param contamination.range Count variants in dbSNP with allelic fraction
#' @param contamination.range Count variants in germline databases with allelic fraction
#' in the specified range. If the number of these putative contamination
#' variants exceeds an expected value and if they are found on almost all
#' chromosomes, the sample is flagged as potentially contaminated and extra
Expand Down Expand Up @@ -49,7 +49,7 @@
#' the specified size in bp. Requires \code{target.granges}.
#' @param DB.info.flag Flag in INFO of VCF that marks presence in common
#' germline databases. Defaults to \code{DB} that may contain somatic variants
#' if it is from an unfiltered dbSNP VCF.
#' if it is from an unfiltered germline database.
#' @return A list with elements \item{vcf}{The filtered \code{CollapsedVCF}
#' object.} \item{flag}{A flag (\code{logical(1)}) if problems were
#' identified.} \item{flag_comment}{A comment describing the flagging.}
Expand Down Expand Up @@ -160,7 +160,7 @@ interval.padding = 50, DB.info.flag = "DB") {
ifelse(fractionContaminated > minFractionContaminated, "maybe", "unlikely"))
}

# do we have many low allelic fraction calls that are in dbSNP on basically
# do we have many low allelic fraction calls that are in germline databases on basically
# all chromosomes? then we found some contamination
if (sum(runLength(seqnames(rowRanges(vcf[idx]))) > 3) >= 20 &&
fractionContaminated >= minFractionContaminated) {
Expand Down Expand Up @@ -223,7 +223,7 @@ interval.padding = 50, DB.info.flag = "DB") {
}
if (!is.null(info(vcf)[[DB.info.flag]]) &&
sum(info(vcf)[[DB.info.flag]]) < nrow(vcf) / 2) {
flog.warn("Less than half of variants in dbSNP. Make sure that VCF %s",
flog.warn("Less than half of variants are likely somatic. Make sure that VCF %s",
"contains both germline and somatic variants.")
}

Expand Down Expand Up @@ -501,7 +501,7 @@ function(vcf, tumor.id.in.vcf, allowed = 0.05) {
} else if (!is.null(info(vcf)[[POPAF.info.field]]) &&
max(unlist(info(vcf)[[POPAF.info.field]]), na.rm = TRUE) > 0.05) {
if (max(unlist(info(vcf)[[POPAF.info.field]]), na.rm = TRUE) > 1.1) {
flog.info("Maximum of POPAP INFO is > 1, assuming -log10 scaled values")
flog.info("Maximum of POPAF INFO is > 1, assuming -log10 scaled values")
db <- info(vcf)[[POPAF.info.field]] < -log10(min.pop.af)
} else {
db <- info(vcf)[[POPAF.info.field]] > min.pop.af
Expand All @@ -523,7 +523,7 @@ function(vcf, tumor.id.in.vcf, allowed = 0.05) {
newInfo <- DataFrame(
Number = 0,
Type = "Flag",
Description = "dbSNP Membership",
Description = "Likely somatic status, based on SOMATIC or Cosmic.CNT info fields, population allele frequency, or dbSNP membership",
row.names = DB.info.flag)
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
info(vcf)[[DB.info.flag]] <- db
Expand Down
4 changes: 2 additions & 2 deletions R/plotAbs.R
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ ss) {

.getAFPlotGroups <- function(r, single.mode) {
if (single.mode) {
groupLevels <- c("dbSNP/germline", "dbSNP/somatic", "novel/somatic",
groupLevels <- c("dbSNP or POPAF/germline", "dbSNP or POPAF/somatic", "novel/somatic",
"novel/germline", "COSMIC/germline", "COSMIC/somatic", "contamination")
r$group <- groupLevels[1]
r$group[r$prior.somatic < 0.1 & r$ML.SOMATIC] <- groupLevels[2]
Expand Down Expand Up @@ -566,7 +566,7 @@ ss) {
y=r$ML.AR[r$ML.SOMATIC][idx.labels],
labels=scatter.labels[idx.labels])
}
idxSomatic <- !grepl("germline|dbSNP|contamination", as.character(r$group))
idxSomatic <- !grepl("germline|dbSNP or POPAF|contamination", as.character(r$group))
if (sum(idxSomatic)) {
colSomatic <- mycol.palette$color[match(names(sort(table(r$group[idxSomatic]),
decreasing=TRUE)[1]), mycol.palette$group)]
Expand Down
2 changes: 1 addition & 1 deletion R/readAllelicCountsFile.R
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ readAllelicCountsFile <- function(file, format, zero=NULL) {
info(header(vcf)) <- DataFrame(
Number = "0",
Type = "Flag",
Description = "dbSNP Membership",
Description = "Likely somatic status, based on SOMATIC or Cosmic.CNT info fields, population allele frequency, or dbSNP membership",
row.names = "DB")

geno(header(vcf)) <- DataFrame(
Expand Down
18 changes: 9 additions & 9 deletions R/runAbsoluteCN.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
#' This function takes as input tumor and normal control coverage data and
#' a VCF containing allelic fractions of germline variants and somatic
#' mutations. Normal control does not need to be from the same patient.
#' In case VCF does not contain somatic status, it should contain dbSNP and
#' optionally COSMIC annotation. Returns purity and ploidy combinations,
#' sorted by likelihood score. Provides copy number and LOH data, by both
#' gene and genomic region.
#' In case VCF does not contain somatic status, it should contain either
#' dbSNP or population allele frequencies, and optionally COSMIC annotation.
#' Returns purity and ploidy combinations, sorted by likelihood score.
#' Provides copy number and LOH data, by both gene and genomic region.
#'
#'
#' @param normal.coverage.file Coverage file of normal control (optional
Expand All @@ -29,7 +29,7 @@
#' @param vcf.file VCF file.
#' Optional, but typically needed to select between local optima of similar
#' likelihood. Can also be a \code{CollapsedVCF}, read with the \code{readVcf}
#' function. Requires a DB info flag for dbSNP membership. The default
#' function. Requires a DB info flag for likely somatic status. The default
#' \code{fun.setPriorVcf} function will also look for a Cosmic.CNT slot (see
#' \code{cosmic.vcf.file}), containing the hits in the COSMIC database. Again,
#' do not expect very useful results without a VCF file.
Expand Down Expand Up @@ -131,10 +131,10 @@
#' likelihood score calculation. Note that bias is reported on an inverse
#' scale; a variant with mapping bias of 1 has no bias.
#' @param max.pon Exclude variants found more than \code{max.pon} times in
#' pool of normals and not in dbSNP. Requires \code{mapping.bias.file} in
#' \code{\link{setMappingBiasVcf}}. Should be set to a value high enough
#' pool of normals and not in germline databases. Requires \code{mapping.bias.file}
#' in \code{\link{setMappingBiasVcf}}. Should be set to a value high enough
#' to be much more likely an artifact and not a true germline variant not
#' present in dbSNP.
#' present in germline databases.
#' @param min.variants.segment Flag segments with fewer variants. The
#' minor copy number estimation is not reliable with insufficient variants.
#' @param iterations Maximum number of iterations in the Simulated Annealing
Expand Down Expand Up @@ -189,7 +189,7 @@
#' and indexed with bgzip and tabix, respectively.
#' @param DB.info.flag Flag in INFO of VCF that marks presence in common
#' germline databases. Defaults to \code{DB} that may contain somatic variants
#' if it is from an unfiltered dbSNP VCF.
#' if it is from an unfiltered germline database.
#' @param POPAF.info.field As alternative to a flag, use an info field that
#' contains population allele frequencies. The \code{DB} info flag has priority
#' over this field when both exist.
Expand Down
17 changes: 9 additions & 8 deletions R/setPriorVcf.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,22 @@
#' function from the VariantAnnotation package.
#' @param prior.somatic Prior probabilities for somatic mutations. First value
#' is for the case when no matched normals are available and the variant is not
#' in dbSNP (second value). Third value is for variants with MuTect somatic
#' call. Different from 1, because somatic mutations in segments of copy number
#' 0 have 0 probability and artifacts can thus have dramatic influence on
#' in germline databases (second value). Third value is for variants with MuTect
#' somatic call. Different from 1, because somatic mutations in segments of copy
#' number 0 have 0 probability and artifacts can thus have dramatic influence on
#' likelihood score. Forth value is for variants not labeled as somatic by
#' MuTect. Last two values are optional, if vcf contains a flag Cosmic.CNT, it
#' will set the prior probability for variants with CNT > 2 to the first of
#' will set the prior probability for variants with CNT > 6 to the first of
#' those values in case of no matched normal available (0.995 default). Final
#' value is for the case that variant is in both dbSNP and COSMIC > 2.
#' value is for the case that variant is in both germline databases and
#' COSMIC count > 6.
#' @param tumor.id.in.vcf Id of tumor in case multiple samples are stored in
#' VCF.
#' @param min.cosmic.cnt Minimum number of hits in the COSMIC database to
#' call variant as likely somatic.
#' @param DB.info.flag Flag in INFO of VCF that marks presence in common
#' germline databases. Defaults to \code{DB} that may contain somatic variants
#' if it is from an unfiltered dbSNP VCF.
#' if it is from an unfiltered germline database.
#' @param Cosmic.CNT.info.field Info field containing hits in the Cosmic database
#' @return The \code{vcf} with \code{numeric(nrow(vcf))} vector with the
#' prior probability of somatic status for each variant in the
Expand Down Expand Up @@ -59,14 +60,14 @@ setPriorVcf <- function(vcf, prior.somatic = c(0.5, 0.0005, 0.999, 0.0001,
if (!is.null(info(vcf)[[Cosmic.CNT.info.field]])) {
flog.info("Found COSMIC annotation in VCF. Requiring %i hits.",
min.cosmic.cnt)
flog.info("Setting somatic prior probabilities for hits to %f or to %f if in both COSMIC and dbSNP.",
flog.info("Setting somatic prior probabilities for hits to %f or to %f if in both COSMIC and likely somatic based on dbSNP membership or population allele frequency.",
tmp[5], tmp[6])

prior.somatic[which(info(vcf)[[Cosmic.CNT.info.field]] >= min.cosmic.cnt)] <- tmp[5]
prior.somatic[which(info(vcf)[[Cosmic.CNT.info.field]] >= min.cosmic.cnt &
info(vcf)[[DB.info.flag]])] <- tmp[6]
} else {
flog.info("Setting somatic prior probabilities for dbSNP hits to %f or to %f otherwise.",
flog.info("Setting somatic prior probabilities for likely somatic hits to %f or to %f otherwise.",
tmp[2], tmp[1])
}
}
Expand Down
2 changes: 1 addition & 1 deletion inst/extdata/PureCN.R
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ option_list <- list(
help = "Maximum considered ploidy [default %default]"),
make_option(c("--max-copy-number"), action = "store", type = "double",
default = max(eval(formals(PureCN::runAbsoluteCN)$test.num.copy)),
help = "Maximum allele-specific integer copy number [default %default]"),
help = "Maximum allele-specific integer copy number, only used for fitting allele-specific copy numbers. Higher copy numbers might still be inferred and reported [default %default]"),
make_option(c("--post-optimize"), action = "store_true", default = FALSE,
help = "Post-optimization [default %default]"),
make_option(c("--bootstrap-n"), action = "store", type = "integer", default = 0,
Expand Down
4 changes: 2 additions & 2 deletions man/filterVcfBasic.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 9 additions & 9 deletions man/runAbsoluteCN.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 7 additions & 6 deletions man/setPriorVcf.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 5 additions & 5 deletions vignettes/PureCN.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -947,7 +947,7 @@ NOISY SEGMENTATION & More than \Rcode{max.segments} \\
NON-ABERRANT & $\geq 99 $\% of genome has identical copy number and $\geq 0.5$\% has second most common state \\
POLYGENOMIC & $\geq 0.75 \times$ \Rcode{max.non.clonal} fraction of the genome in sub-clonal state \\
POOR GOF & GoF $<$ \Rcode{min.gof} \\
POTENTIAL SAMPLE CONTAMINATION & Significant portion of dbSNP variants potentially cross-contaminated \\
POTENTIAL SAMPLE CONTAMINATION & Significant portion of variants present in germline databases are potentially cross-contaminated \\
RARE KARYOTYPE & Ploidy $< 1.5$ or $> 4.5$ \\
\bottomrule
\end{tabular}
Expand Down Expand Up @@ -1115,12 +1115,12 @@ If a matched normal is not available, it is also helpful to provide
\Rcode{cosmic.vcf.file} (or via a \Rcode{Cosmic.CNT} INFO field in the VCF).
While this has limited effect on purity and ploidy estimation due the sparsity
of hotspot mutations, it often helps in the manual curation to compare how well
high confidence germline (dbSNP) vs. somatic (COSMIC) variants fit a particular
purity/ploidy combination.
high confidence germline (based on dbSNP membership or POPAF) vs. somatic (COSMIC)
variants fit a particular purity/ploidy combination.

For variant classification (Section~\ref{predictsomatic}), providing COSMIC
annotation also avoids that hotspot mutations with dbSNP id get assigned a very
low prior probability of being somatic.
annotation also avoids that hotspot mutations in germline databases get assigned
a very low prior probability of being somatic.

\subsection{ExAC and gnomAD annotation}

Expand Down

0 comments on commit a41f054

Please sign in to comment.