-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FINEMAP credible sets not included in the results #7
Comments
Hi @jdblischak, thanks for reporting this in such detail! Here are my thoughts currently, any suggestions are welcome. Credible SetsCurrently, I only record the top credible set in the multi-finemap file bc I was trying to avoid assigning multiple PPs to each SNP to keep FINEMAP.PP as a single column (all my other echolocatoR code is built around this structure). So basically, if SNPs are in the top CS they're assigned FINEMAP.CS=1 else 0. Perhaps there's a better way to do this that preserves this information without the user having to read in multiple .cred files themselves after running I can think of two strategies currently:
FINEMAP version differences
zstandard
|
@bschilder Thanks for the detailed follow-up. My thoughts below:
I didn't realize that Also, I don't think that is what the code does, at least not for FINEMAP 1.4. As far as I can tell, the CS info from FINEAMP 1.4 is not consulted at all. Instead the algorithm appears to be
I don't understand this. The PIP for a SNP is unaffected by the CS it is assigned. My recommendation would be to obtain the information the same as the PolyFun authors do, unless you have a compelling reason that is incorrect:
I'm happy to help with this. And my apologies if I am missing something obvious.
You should absolutely email the author for help troubleshooting the installation on macOS. There is no public forum for discussing these issues, and the code is closed source, so we can't build a conda package to ease installation. |
Thanks for the feedback @jdblischak! This does get rather confusing so I may have not have explained some things very well. I've tried a number of different ways about this so it's entirely possible I could be mixing up strategies from different version of echolocatoR and FINEMAP (apologies if this is the case). Credible Sets and Posterior ProbabilitiesOne of the main decision points I've gone back and forth on is whether to use
This is all a bit tricky to explain in words so I'm going to push some of these proposed changes to the dev branch. If you don't mind, perhaps you could then look these over and see if they seem sensible to you? FINEMAP supportI tried reaching out to the FINEMAP author a couple weeks ago via email but haven't heard back yet (christian.benner@helsinki.fi and cc'ed finemap@christianbenner.com). Not sure if he still uses these emails but I can't seem to find any others. Also tried reaching out via LinkedIn. I'll keep you posted if I have any luck. Thanks again for your help! |
This doesn't make sense to me. If the goal is to produce results that can be compared across the different fine-mapping methods, I don't see why the conditional probabilities should be reported. For susieR you are reporting the PIPs in From your documentation: From the FINEMAP documentation for
From the FINEMAP documentation for
If you want to report the posterior probability that a given SNP is causal, I think you should report In cases where there are only 1 SNP per CS, I don't think it matters whether you report the PIP from As an example, here's code to run the example shipped with FINEMAP: mkdir /tmp/fmexample
cd /tmp/fmexample
wget http://christianbenner.com/finemap_v1.4_x86_64.tgz
tar xf finemap_v1.4_x86_64.tgz
cd finemap_v1.4_x86_64/
./finemap_v1.4_x86_64 --sss --in-files example/master -dataset 1 Taking a quick look at the results in R, PIP and PP are not identical (though they are correlated). The most dramatic difference is rs6. It's PIP is only 0.04. However, it's PP for being included in cred3 is 0.17. library(data.table)
snp <- fread("example/data.snp")
snp[prob > 0.009, .(rsid, prob)]
# rsid prob
# 1: rs7 1.0000000
# 2: rs30 0.9999690
# 3: rs6 0.0417854
# 4: rs31 0.0107216
# 5: rs8 0.0105046
# 6: rs48 0.0100717
cred2 <- fread("example/data.cred2", skip = 5)
cred2
# index cred1 prob1 cred2 prob2
# 1: 1 rs7 1 rs30 1
cred3 <- fread("example/data.cred3", skip = 5)
cred3[prob3 > 0.02, ]
# index cred1 prob1 cred2 prob2 cred3 prob3
# 1: 1 rs7 1 rs30 1 rs6 0.1733260
# 2: 2 <NA> NA <NA> NA rs31 0.0433775
# 3: 3 <NA> NA <NA> NA rs8 0.0424081
# 4: 4 <NA> NA <NA> NA rs48 0.0405975
# 5: 5 <NA> NA <NA> NA rs46 0.0361650
# 6: 6 <NA> NA <NA> NA rs14 0.0281224
# 7: 7 <NA> NA <NA> NA rs10 0.0253583
# 8: 8 <NA> NA <NA> NA rs11 0.0248557
# 9: 9 <NA> NA <NA> NA rs19 0.0246048
# 10: 10 <NA> NA <NA> NA rs32 0.0242520
# 11: 11 <NA> NA <NA> NA rs20 0.0241354
# 12: 12 <NA> NA <NA> NA rs17 0.0237760
# 13: 13 <NA> NA <NA> NA rs13 0.0233587
This is a reasonable suggestion, but may not be worth the effort. If you report the highest number of credible sets, the user can always apply a PIP threshold to remove the SNPs that are less convincing (and thus remove the less convincing credible sets as well). The earlier credible sets should be unchanged. The FINEMAP example above illustrates this. The PP that there are only 2 causal SNPs is much higher than the PP for 3 causal SNPs. However, if you were to report the results from % head -n 1 example/data.cred2
# Post-Pr(# of causal SNPs is 2) = 0.773769
% head -n 1 example/data.cred3
# Post-Pr(# of causal SNPs is 3) = 0.202562
Looking at b351c10b4cb25c21446abb3791a8cb0bb4084eda, I think it's good you now detect cred files other than Though for the reason I stated above (and for consistency with PolyFun), I think you should select the cred file with the largest number of configurations, not the fewest, i.e. |
Note: When performing fine-mapping via reprexlocus_dir <- file.path(tempdir(),echodata::locus_dir)
dat <- echofinemap::drop_finemap_cols(echodata::BST1)
LD_matrix <- echodata::BST1_LD_matrix
out <- echoLD::subset_common_snps(LD_matrix, dat)
LD_matrix <- out$LD
dat <- out$DT
dat2 <- echofinemap::FINEMAP(dat=dat,
locus_dir=locus_dir,
LD_matrix=LD_matrix)
results
FINEMAP version differencesI've spent quite a bit of time examining the outputs of each FINEMAP's versions outputs in different conditions, and I think I've accounted for these in |
@bschilder Thanks for following up and for putting all this work into new functionality and documentation.
In general I think returning more results to the end user is the right thing to do. With your new setup, how does the number of credible sets compare between SuSiE and FINEMAP? That was the main issue I had with the previous iteration (ie FINEMAP always showed 1 CS even if it actually identified more). I attempted to run your reprex, but I couldn't get echoconda installed:
|
Regarding the conda installation issue, could you include your OS specs? Haven't encountered this issue on MacOS or Linux docker containers, and can't seem to replicate it now. |
Discovered while troubleshooting RajLabMSSM/echofinemap#7 Confirmed this was also caught by `R CMD check`: https://github.com/RajLabMSSM/echoconda/actions/runs/3028422625/jobs/4873194999#step:23:48
That error happened in a peculiar setup. R is running inside of a singularity container on an HPC environment. In case it's helpful, below I reproduced the installation error and appended the session information: > remotes::install_github("RajLabMSSM/echoconda", upgrade = FALSE)
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo RajLabMSSM/echoconda@HEAD
✔ checking for file ‘/local/tmp/RtmphmtsG7/remotes561138236260/RajLabMSSM-echoconda-49cac2a/DESCRIPTION’ (351ms)
─ preparing ‘echoconda’:
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Omitted ‘LazyData’ from DESCRIPTION
─ building ‘echoconda_0.99.7.tar.gz’
Installing package into ‘/mnt/user-library/r420-bioc315-20220506’
(as ‘lib’ is unspecified)
* installing *source* package ‘echoconda’ ...
** using non-staged installation via StagedInstall field
Error in data.table::fread(system.file(package = "echoconda", "conda/echoR_versions.tsv.gz")) :
Input is empty or only contains BOM or terminal control characters
Calls: <Anonymous> -> eval -> eval -> get_echoR -> <Anonymous>
Execution halted
ERROR: configuration failed for package ‘echoconda’
* removing ‘/mnt/user-library/r420-bioc315-20220506/echoconda’
Warning message:
In i.p(...) :
installation of package ‘/local/tmp/RtmphmtsG7/file5611322a38a1/echoconda_0.99.7.tar.gz’ had non-zero exit status
> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] processx_3.7.0 compiler_4.2.0 R6_2.5.1 rprojroot_2.0.3
[5] cli_3.3.0 prettyunits_1.1.1 tools_4.2.0 withr_2.5.0
[9] curl_4.3.2 crayon_1.5.1 remotes_2.4.2 callr_3.7.0
[13] pkgbuild_1.3.1 ps_1.7.1 |
Also, I was able to install echoconda and echofinemap in a more traditional setup with R installed in a conda environment. I just had to install echodata prior to installing echoconda (see PR RajLabMSSM/echoconda#6) > remotes::install_github("RajLabMSSM/echodata", upgrade = FALSE)
> remotes::install_github("RajLabMSSM/echoconda", upgrade = FALSE)
> library(echoconda)
> remotes::install_github("RajLabMSSM/echofinemap", upgrade = FALSE)
> library(echofinemap)
> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: ~/.conda/envs/test-echo/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] echofinemap_0.99.3 echoconda_0.99.7
loaded via a namespace (and not attached):
[1] colorspace_2.0-3 rjson_0.2.21
[3] ellipsis_0.3.2 downloadR_0.99.4
[5] XVector_0.34.0 GenomicRanges_1.46.1
[7] remotes_2.4.2 DT_0.25
[9] bit64_4.0.5 AnnotationDbi_1.56.1
[11] fansi_1.0.3 xml2_1.3.3
[13] splines_4.1.3 snpStats_1.44.0
[15] R.methodsS3_1.8.2 cachem_1.0.6
[17] echoLD_0.99.7 jsonlite_1.8.0
[19] Rsamtools_2.10.0 dbplyr_2.2.1
[21] png_0.1-7 R.oo_1.25.0
[23] echodata_0.99.12 BiocManager_1.30.18
[25] readr_2.1.2 compiler_4.1.3
[27] httr_1.4.4 basilisk_1.6.0
[29] assertthat_0.2.1 Matrix_1.4-1
[31] fastmap_1.1.0 cli_3.4.0
[33] htmltools_0.5.3 prettyunits_1.1.1
[35] tools_4.1.3 gtable_0.3.1
[37] glue_1.6.2 GenomeInfoDbData_1.2.7
[39] dplyr_1.0.10 rappdirs_0.3.3
[41] Rcpp_1.0.9 Biobase_2.54.0
[43] vctrs_0.4.1 Biostrings_2.62.0
[45] echotabix_0.99.8 rtracklayer_1.54.0
[47] stringr_1.4.1 openxlsx_4.2.5
[49] lifecycle_1.0.2 irlba_2.3.5
[51] restfulr_0.0.15 XML_3.99-0.10
[53] zlibbioc_1.40.0 scales_1.2.1
[55] basilisk.utils_1.6.0 BSgenome_1.62.0
[57] VariantAnnotation_1.40.0 coloc_5.1.0.1
[59] hms_1.1.2 MatrixGenerics_1.6.0
[61] parallel_4.1.3 SummarizedExperiment_1.24.0
[63] susieR_0.12.27 yaml_2.3.5
[65] curl_4.3.2 memoise_2.0.1
[67] reticulate_1.26 gridExtra_2.3
[69] ggplot2_3.3.6 biomaRt_2.50.0
[71] reshape_0.8.9 stringi_1.7.8
[73] RSQLite_2.2.8 S4Vectors_0.32.4
[75] BiocIO_1.4.0 GenomicFeatures_1.46.1
[77] BiocGenerics_0.40.0 filelock_1.0.2
[79] zip_2.2.1 BiocParallel_1.28.3
[81] GenomeInfoDb_1.30.0 rlang_1.0.5
[83] pkgconfig_2.0.3 matrixStats_0.62.0
[85] bitops_1.0-7 lattice_0.20-45
[87] purrr_0.3.4 GenomicAlignments_1.30.0
[89] htmlwidgets_1.5.4 bit_4.0.4
[91] tidyselect_1.1.2 plyr_1.8.7
[93] magrittr_2.0.3 R6_2.5.1
[95] IRanges_2.28.0 generics_0.1.3
[97] piggyback_0.1.4 DelayedArray_0.20.0
[99] DBI_1.1.3 pillar_1.8.1
[101] survival_3.4-0 KEGGREST_1.34.0
[103] RCurl_1.98-1.8 mixsqp_0.3-43
[105] tibble_3.1.8 dir.expiry_1.2.0
[107] crayon_1.5.1 utf8_1.2.2
[109] BiocFileCache_2.2.0 tzdb_0.3.0
[111] viridis_0.6.2 progress_1.2.2
[113] grid_4.1.3 data.table_1.14.2
[115] blob_1.2.3 digest_0.6.29
[117] tidyr_1.2.1 R.utils_2.12.0
[119] stats4_4.1.3 munsell_0.5.0
[121] viridisLite_0.4.1 |
Now that I have echofinemap installed, I tried to run the reprex. I got an error from MungeSumstats. I probably need to update all the packages to their latest versions, but I don't have the time to continue troubleshooting this at the moment locus_dir <- file.path(tempdir(),echodata::locus_dir)
dat <- echofinemap::drop_finemap_cols(echodata::BST1)
LD_matrix <- echodata::BST1_LD_matrix
out <- echoLD::subset_common_snps(LD_matrix, dat)
LD_matrix <- out$LD
dat <- out$DT
dat2 <- echofinemap::FINEMAP(dat=dat,
locus_dir=locus_dir,
LD_matrix=LD_matrix)
Loading required namespace: MungeSumstats
Preparing sample size column (N).
Error: 'compute_nsize' is not an exported object from 'namespace:MungeSumstats'
> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /gstore/home/blischaj/.conda/envs/test-echo/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] echofinemap_0.99.3 echoconda_0.99.7
loaded via a namespace (and not attached):
[1] colorspace_2.0-3 rjson_0.2.21
[3] ellipsis_0.3.2 downloadR_0.99.4
[5] XVector_0.34.0 fs_1.5.2
[7] GenomicRanges_1.46.1 remotes_2.4.2
[9] DT_0.25 bit64_4.0.5
[11] AnnotationDbi_1.56.1 fansi_1.0.3
[13] xml2_1.3.3 splines_4.1.3
[15] snpStats_1.44.0 R.methodsS3_1.8.2
[17] cachem_1.0.6 echoLD_0.99.7
[19] jsonlite_1.8.0 Rsamtools_2.10.0
[21] dbplyr_2.2.1 png_0.1-7
[23] R.oo_1.25.0 echodata_0.99.12
[25] BiocManager_1.30.18 readr_2.1.2
[27] compiler_4.1.3 httr_1.4.4
[29] basilisk_1.6.0 assertthat_0.2.1
[31] Matrix_1.4-1 fastmap_1.1.0
[33] gargle_1.2.1 cli_3.4.0
[35] htmltools_0.5.3 prettyunits_1.1.1
[37] tools_4.1.3 gtable_0.3.1
[39] glue_1.6.2 GenomeInfoDbData_1.2.7
[41] dplyr_1.0.10 rappdirs_0.3.3
[43] Rcpp_1.0.9 Biobase_2.54.0
[45] vctrs_0.4.1 Biostrings_2.62.0
[47] echotabix_0.99.8 rtracklayer_1.54.0
[49] stringr_1.4.1 MungeSumstats_1.2.0
[51] openxlsx_4.2.5 lifecycle_1.0.2
[53] irlba_2.3.5 restfulr_0.0.15
[55] XML_3.99-0.10 googleAuthR_2.0.0
[57] zlibbioc_1.40.0 scales_1.2.1
[59] basilisk.utils_1.6.0 BSgenome_1.62.0
[61] VariantAnnotation_1.40.0 coloc_5.1.0.1
[63] hms_1.1.2 MatrixGenerics_1.6.0
[65] parallel_4.1.3 SummarizedExperiment_1.24.0
[67] susieR_0.12.27 yaml_2.3.5
[69] curl_4.3.2 memoise_2.0.1
[71] reticulate_1.26 gridExtra_2.3
[73] ggplot2_3.3.6 biomaRt_2.50.0
[75] reshape_0.8.9 stringi_1.7.8
[77] RSQLite_2.2.8 S4Vectors_0.32.4
[79] BiocIO_1.4.0 GenomicFeatures_1.46.1
[81] BiocGenerics_0.40.0 filelock_1.0.2
[83] zip_2.2.1 BiocParallel_1.28.3
[85] GenomeInfoDb_1.30.0 rlang_1.0.5
[87] pkgconfig_2.0.3 matrixStats_0.62.0
[89] bitops_1.0-7 lattice_0.20-45
[91] purrr_0.3.4 GenomicAlignments_1.30.0
[93] htmlwidgets_1.5.4 bit_4.0.4
[95] tidyselect_1.1.2 plyr_1.8.7
[97] magrittr_2.0.3 R6_2.5.1
[99] IRanges_2.28.0 generics_0.1.3
[101] piggyback_0.1.4 DelayedArray_0.20.0
[103] DBI_1.1.3 pillar_1.8.1
[105] survival_3.4-0 KEGGREST_1.34.0
[107] RCurl_1.98-1.8 mixsqp_0.3-43
[109] tibble_3.1.8 dir.expiry_1.2.0
[111] crayon_1.5.1 utf8_1.2.2
[113] BiocFileCache_2.2.0 tzdb_0.3.0
[115] viridis_0.6.2 progress_1.2.2
[117] grid_4.1.3 data.table_1.14.2
[119] blob_1.2.3 digest_0.6.29
[121] tidyr_1.2.1 R.utils_2.12.0
[123] stats4_4.1.3 munsell_0.5.0
[125] viridisLite_0.4.1 |
1.
This happens bc 2.
Great, thanks for the PR! I must have added 3.
You're exactly right, Though it's odd this is coming up for you since I have a minimum version required for
|
This isn't blocking me, so feel free to ignore. But I already had R.utils installed in that environment, so I am skeptical that was the problem. In case it could be useful for future troubleshooting, here are the versions of data.table and R.utils I had: library("data.table")
## data.table 1.14.2 using 1 threads (see ?getDTthreads). Latest news: r-datatable.com
library("R.utils")
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: ‘R.oo’
##
## The following object is masked from ‘package:R.methodsS3’:
##
## throw
##
## The following objects are masked from ‘package:methods’:
##
## getClasses, getMethods
##
## The following objects are masked from ‘package:base’:
##
## attach, detach, load, save
##
## R.utils v2.12.0 (2022-06-28 03:20:05 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: ‘R.utils’
##
## The following object is masked from ‘package:utils’:
##
## timestamp
##
## The following objects are masked from ‘package:base’:
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
##
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] R.utils_2.12.0 R.oo_1.25.0 R.methodsS3_1.8.2 data.table_1.14.2
##
## loaded via a namespace (and not attached):
## [1] compiler_4.2.0 tools_4.2.0 |
Also, would you be able to export the |
While exploring the results in the echolocatoR Shiny app, I noticed that FINEMAP never returns more than 1 credible set per locus. This was confusing since SuSiE often returns more than 1 credible set.
I started diving into the code, and I think the issue is that it only looks for
data.cred
, when FINEMAP returnsdata.cred#
where#
is the number of credible sets. PolyFun obtains the credible sets by searching backwards from the maximum number of causal SNPs until it finds a.cred
file (source):echolocatoR only checks for
data.cred
:https://github.com/RajLabMSSM/echolocatoR/blob/b055ac0fb74c914d7600e3afc650a2ffd7149396/R/FINEMAP.R#L336
And then when it is extracting the PIPs from the
.snp
file, it assigns any SNP that meets the PIP threshold to CS 1.https://github.com/RajLabMSSM/echolocatoR/blob/b055ac0fb74c914d7600e3afc650a2ffd7149396/R/FINEMAP.R#L363-L364
I attempted to put together a minimal, reproducible example to demonstrate this behavior. However, using the latest version on master, FINEMAP is currently returning
NA
for both the.CS
and the.PP
columns.See here for minimal, reproducible example
See here for the results
The text was updated successfully, but these errors were encountered: