ggLocusZoom
has since been deprecated and is now replaced by echoplot
as part of the broader echoverse suite.
The source code for ggLocusZoom
is being kept here simply for posterity.
LocusZoom is a very useful and commonly used tool for visualizing genomic signals in GWAS/QTL data. But when you're making many plots for many loci, it makes sense to do this programmatically.
There is a standalone version of LocusZoom for R and Python, but unfortunately it requires downloading 23GB of data to get all of the databases it relies on. Even if you supply your own data (summary stats, LD, etc.), it's not particularly user friendly. Nor can you customize it further once you have it.
So I made a user-friendly version of it in R that allows you to customize it as a ggplot-like object.
Note: my.LocusZoom now offers the ability to upload all your sum stats at once and make plots for each locus, which helps reduce the redundancy of making each plot individually. That said, it currently doesn't let you customize the resulting figures.
The following tutorial can also be downloaded as an Rscript here.
In command line:
git clone https://github.com/RajLabMSSM/ggLocusZoom.git
cd ggLocusZoom
Optional: You can avoid the hassle of installing different dependencies by
creating a conda environment from this yaml file:
conda env create -f gglocuszoom.yml
conda activate gglocuszoom
- This file can be comma-separated (
.csv
/.txt
) or tab-separated (.tsv
/txt
). - It can also be compressed (
.gz
). - Required columns in summary stats data (one SNP per row):
SNP
: RSID (e.g. rs4698412)P
: Uncorrected p-value from GWAS (e.g. 2.058e-28)CHR
: Chromosome (e.g. 4)POS
: Genomic position in terms of bp (e.g. 15737348)
- An n x n matrix where n is the number of SNPs in your summary stats file. Column and row names must be the RSIDs of the SNPs.
- Can be
.RDS
,.csv
,.tsv
, or.txt
format. - If using this method, you can skip right to step 3. below.
-
If you don't have your own LD matrix, you can download it via the
LD.UKBibank()
function provided here. It will return the path to where the LD matrix was saved. -
NOTE: This can take a very long time (~10-15m/locus) because it has to first import a much larger LD matrix (in
.npz
format) from the Alkes Price lab and then subset it to just the SNPs in your summary stats file. To save space, only the subset that includes your summary stat SNPs will saved saved (as a.RDS
file).- axel: You can drastically speed up this process by installing the program
axel
(download instructions here), which uses multithreading to rapidly download a single file. To do this, setdownload_method="axel"
(default) ANDdownload_full_ld=T
. - wget: If you don't want to install
axel
, you can still usewget
(comes with most computers) to download the .npz/.gz files locally. To do this, setdownload_method="wget"
(default) ANDdownload_full_ld=T
. - For Mount Sinai employees and affiliates only: That said, you can speed up this process a lot by logging into Chimera (Mount Sinai computing cluster). If you set
chimera=T
, thenLD.UKBibank()
will automatically find the UKB LD files that I've pre-downloaded, and imports the one you need.
- axel: You can drastically speed up this process by installing the program
source("./functions/ggLocusZoom.R")
# Optional:
## If you're using the conda enviroment (step 1.5 above), you'll need to tell R to use it with reticulate
reticulate::use_condaenv("")
LD_path <- LD.UKBiobank(# Specify where the summary stats file is.
sumstats_path="./example_data/BST1_Nalls23andMe_2019_subset.txt",
# The folder where you want to save the LD matrix
# (defaults to ./LD)
results_path="./LD",
# If =T , will overwrite a pre-existing LD file with the same name.
force_new_LD=F,
# The same of the locus
# (defaults to "_")
locus="BST1",
# Use pre-downloaded LD files on Chimera computing cluster
## (Mount Sinai employees and affiliates only).
## Must be logged onto Chimera and have
## access to the 'pd-omics' project.
chimera=F,
# [** WARNING **]:
## Only change these defaults if you have plenty of extra storage.
## Each of these files is ~1-3GB.
## Download and save the full .npz/.gz files.
download_full_ld=F,
# You can use either 'wget' or 'axel' to download the files
download_method = "axel",
# Delete the full .gz/.npz ld files after you're done converting them into .RDS
remove_tmps=T)
print(LD_path)
# Import the functions
source("./functions/ggLocusZoom.R")
# Run
gglz <- ggLocusZoom(# Specify where the summary stats file is.
sumstats_path="./example_data/BST1_Nalls23andMe_2019_subset.txt",
# Specify where the pre-computed LD matrix is.
LD_path="./example_data/BST1_UKB-LD.RDS",
# Is the LD matrix in units of r? (as opposed to r2)
LD_units="r",
# Show the plot when it's ready?
show_plot=T,
# Save the plot? (=F if you don't want to)
save_plot="./BST1_ggLocusZoom.png",
# Saved plot height.
height=5,
# Saved plot width.
width=10,
# Optional plot title.
title="",
# Draw a vertical line at the position of the lead SNP.
leadSNP.line=T,
# Plot LD with the leda SNP as a categorical variable
# (very low, low, medium, high, very high) or a continuous variable (0-1).
categorical_r2=T,
# By default (leadGWAS), it will pick the SNP with the lowest p-value
as the index SNP for which to extract LD.
## Alternatively, you can provide an RSID contained within the
dataset as the index SNP (e.g. "rs6849244").
index_snp="leadGWAS",
# Define the limits of the xlim (e.g. c(15239141,16236140)).
## If NULL (default), then the min/max positions in the sumstats will be used.
xlims = NULL,
# The maximum number of transcripts to show per gene in the Gene track
max_transcripts = 3
)
UKB_LD.RDS
: LD matrix from UKBiobank, preprocessed and made available by the Alkes Price lab.BST1_Nalls23andMe_2019_subset.txt
: Summary stats from the PD GWAS in Nalls et al. (2019).
Brian M. Schilder
Raj Lab
Department of Neuroscience
Department of Genetics & Genomic Sciences
Ronald M. Loeb Center for Alzheimer's Disease
Icahn School of Medicine at Mount Sinai