The R package gact is designed for establishing and populating a comprehensive database focused on genomic associations with complex traits. The package serves two primary functions: infrastructure creation and data acquisition. It facilitates the assembly of a structured repository that includes single marker associations, carefully curated to maintain high data quality. Beyond individual genetic markers, the package integrates a broad spectrum of genomic entities, encompassing genes, proteins, and an array of biological complexes (chemical and protein), as well as various biological pathways. It is designed to aid in the biological interpretation of genomic associations, shedding light on their complex relationships in the context of genomic associations of complex traits.
gact provides an infrastructure for efficient processing of large-scale genomic association data, including core functions for:
- Establishing and populating a database for genomic association.
- Downloading and processing a range of biological databases.
- Downloading and processing summary statistics from genome-wide association studies (GWAS).
- Conducting bioinformatic procedures to link genetic markers with genes, proteins, metabolites, and biological pathways.
- Finemapping of genomic regions using Bayesian Linear Regression models.
- Performing advanced gene set enrichment analysis utilizing a variety of tools and methodologies.
gact constructs gene and genetic marker sets from a range of biological databases including:
"Ensembl"
: Gene, protein, transcript sets derived from the Ensembl database."Regulation"
: Regulatory genomic feature sets derived from the Ensembl Regulation database."GO"
: Gene Ontology sets from the GO database."Pathways"
: Pathway sets from the Reactome and KEGG databases."ProteinComplexes"
: Protein complex sets derived from the STRING database."ChemicalComplexes"
: Chemical complex sets derived from the STITCH database."DrugGenes"
: Drug-gene interaction sets the DrugBank database."DrugATCGenes"
: Drug ATC gene sets based on the ATC and DrugBank databases."DrugComplexes"
: Drug gene complex sets combining information from STRING and DrugBank."DiseaseGenes"
: Disease-gene sets based on experiments, textmining and knowledge base derived from the DISEASE database."GTEx"
: GTEx project eQTL sets derived from the GTEx database."GWAScatalog"
: GWAS catalog sets derived from the GWAScatalog database."VEP"
: Variant Effect Predictor sets derived from the Ensembl Variant Effect Predictor database.
To install the most recent version of the gact and qgg package from GitHub, use the following commands in R:
library(devtools)
devtools::install_github("psoerensen/gact")
devtools::install_github("psoerensen/qgg")
Below is a set of tutorials used for the gact package:
Download and set up the gact database, which is focused on genomic
associations for complex traits:
Download and install gact
database
Download and process genotype data from the 1000 Genomes Project (1000G)
for different ancestries (European, East Asian, South Asian) used in
different genomic analysis:
Download and process of 1000G data
Computing sparse Linkage Disequilibrium (LD) matrices for 1000 Genomes
Project (1000G) data across different ancestries and exploring the LD
data which is used in a number of genomic analysis (LD score regression,
Vegas gene analysis, Bayesian Linear Regression models):
Compute sparse LD matrices for 1000G
data
Downloading and processing genome-wide association summary statistic and
ingest into database:
Download and process new GWAS summary
statistics
Gene analysis using the VEGAS (Versatile Gene-based Association Study)
approach using the 1000G LD reference data processed above:
Gene analysis using VEGAS
Gene set enrichment analysis (GSEA) based on BLR (Bayesian Linear
Regression) model derived gene-level statistics and MAGMA (Multi-marker
Analysis of GenoMic Annotation) (Bai et al. 2024).
Gene set analysis using MAGMA
Pathway prioritization using a single and multiple trait Bayesian MAGMA
models and gene-level statistics derived from VEGAS (Gholipourshahraki
et al. 2024).
Gene set analysis using Bayesian
MAGMA
Polygenic Prioritization Scoring (PoPS) using BLR models and gene-level
statistics derived from VEGAS (work in progress).
Gene ranking using PoPS
Finemapping with single trait Bayesian Linear Regression models and
simulated data (Shrestha et al. 2023).
Finemapping using BLR models on simulated
data
Finemapping of gene and LD regions using single trait Bayesian Linear
Regression models (Shrestha et al. 2023).
Finemapping using BLR models on real
data
Polygenic scoring (PGS) using Bayesian Linear Regression models and
biological pathway information (work in progress).
Polygenic scoring using BLR
models
Polygenic scoring (PGS) using summary statistics from PGS catalog and
biological pathway information.
Polygenic scoring using PGS
Catalog
LD score regression for estimating genomic heritability and
correlations.
LD score regression
These notes and scripts are prepared in the BALDER project funded by the ODIN platform. ODIN is sponsored by the Novo Nordisk Foundation (grant number NNF20SA0061466)
-
Rohde PD, Sørensen IF, Sørensen P. 2020. qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics 36:8. doi.org/10.1093/bioinformatics/btz955
-
Rohde PD, Sørensen IF, Sørensen P. 2023. Expanded utility of the R package, qgg, with applications within genomic medicine. Bioinformatics 39:11. doi.org/10.1093/bioinformatics/btad656
-
Shrestha et al. 2023. Evaluation of Bayesian Linear Regression Models as a Fine Mapping Tool. Submitted doi.org/10.1101/2023.09.01.555889
-
Bai et al. 2024. Evaluation of multiple marker mapping methods using single trait Bayesian Linear Regression models. In preparation
-
Gholipourshahraki et al. 2024. Evaluation of Bayesian Linear Regression Models for Pathway Prioritization. In preparation