Skip to content

Commit

Permalink
Major refactoring
Browse files Browse the repository at this point in the history
  • Loading branch information
Rosemeis committed Jan 12, 2025
1 parent f90d41f commit 58ccfd9
Show file tree
Hide file tree
Showing 18 changed files with 1,086 additions and 1,089 deletions.
61 changes: 24 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,63 @@
# PCAngsd

**Version 1.35**
# PCAngsd (v1.36.0)

Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.

The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. *PCAngsd* can perform the following analyses:
The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. `pcangsd` can be used for the following analyses:

* Covariance matrix
* Admixture estimation
* Inbreeding coefficients (both per-individual and per-site)
* Inbreeding coefficients (both per-sample and per-site)
* HWE test
* Genome-wide selection scans
* Genotype calling
* Estimate NJ tree of samples


## Get PCAngsd and build
### Dependencies
The *PCAngsd* software relies on the following Python packages that you can install through conda (recommended) or pip:

- numpy
- cython
- scipy

You can create an environment through conda easily or install dependencies through pip as follows:
```
# Conda environment
conda env create -f environment.yml
# pip
pip3 install --user -r requirements.txt
```

## Install and build
## Installation
```bash
# Option 1: Build and install via PyPI
pip install pcangsd

# Option 2: Download source and install via pip
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
pip3 install .
```
pip install .

You can now run *PCAngsd* with the `pcangsd` command.
# Option 3: Download source and install in a new Conda environment
git clone https://github.com/Rosemeis/pcangsd.git
conda env create -f pcangsd/environment.yml
conda activate pcangsd
```
You can now run the program with the `pcangsd` command.

## Usage
### Running *PCAngsd*
*PCAngsd* works directly on genotype likelihood files or PLINK files.
`pcangsd` works directly on genotype likelihood files or PLINK files.
```bash
# See all options
pcangsd -h

# Genotype likelihood file in Beagle format with 2 eigenvectors using 64 threads
pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd
pcangsd --beagle input.beagle.gz --eig 2 --threads 64 --out pcangsd
# Outputs by default log-file (pcangsd.log) and covariance matrix (pcangsd.cov)

# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
pcangsd -p input.plink -e 2 -t 64 -o pcangsd
pcangsd --plink input.plink --eig 2 --threads 64 --out pcangsd

# Perform PC-based selection scan and estimate admixture proportions
pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd --selection --admix
pcangsd --beagle input.beagle.gz --eig 2 --threads 64 --out pcangsd --selection --admix
# Outputs the following files:
# log-file (pcangsd.log)
# covariance matrix (pcangsd.cov)
# selection statistics (pcangsd.selection)
# admixture proportions (pcangsd.admix.3.Q)
# ancestral allele frequencies (pcangsd.admix.3.F)
```
*PCAngsd* will output most files in text-format.
`pcangsd` will output most files in text-format.

Quick example of reading output and creating PCA plot in R:
Quick example of reading output and creating PCA plot in *R*:
```R
C <- as.matrix(read.table("pcangsd.cov")) # Reads estimated covariance matrix
D <- as.matrix(read.table("pcangsd.selection")) # Reads PC based selection statistics
D <- as.matrix(read.table("pcangsd.selection")) # Reads PC-based selection statistics

# Plot PCA plot
e <- eigen(C)
Expand All @@ -79,7 +67,7 @@ plot(e$vectors[,1:2], xlab="PC1", ylab="PC2", main="PCAngsd")
p <- pchisq(D, 1, lower.tail=FALSE)
```

Read files in python and create PCA plot using matplotlib:
Read files in *python* and create PCA plot using matplotlib:
```python
import matplotlib.pyplot as plt
import numpy as np
Expand All @@ -102,9 +90,8 @@ p = chi2.sf(D, 1)

Beagle genotype likelihood files can be generated from BAM files using [ANGSD](https://github.com/ANGSD/angsd). For inference of population structure in genotype data with non-random missigness, we recommend our [EMU](https://github.com/Rosemeis/emu) software that performs accelerated EM-PCA, however with fewer functionalities than *PCAngsd*.


## Citation
Please cite our papers:
Please cite our papers if you use the `pcangsd` framework:

Population structure: [Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data](http://www.genetics.org/content/210/2/719).\
HWE test: [Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019).\
Expand Down
13 changes: 7 additions & 6 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
name: pcangsd
channels:
- defaults
dependencies:
- python
- numpy
- scipy
- cython
- python>=3.10
- cython>3.0.0
- numpy>2.0.0
- scipy>1.14.0
- pip
- pip:
- pcangsd
122 changes: 56 additions & 66 deletions pcangsd/admixture.py
Original file line number Diff line number Diff line change
@@ -1,72 +1,58 @@
"""
PCAngsd.
Estimate admixture proportions and ancestral allele frequencies.
"""

__author__ = "Jonas Meisner"

# libraries
import numpy as np
from math import ceil

# Import scripts
from pcangsd import shared_cy
from pcangsd import admixture_cy

##### Admixture estimation #####
def admixNMF(L, P, K, alpha, iter, tole, batch, seed, verbose, t):
m, n = P.shape
np.random.seed(seed)
shuffleP = np.random.permutation(m)
X = P[shuffleP,:] # Shuffled individual allele frequencies
def admixNMF(L, P, K, alpha, iter, tole, batch, rng, verbose):
M, N = P.shape
M_b = ceil(float(M)/batch)
S_i = rng.permuted(np.arange(M, dtype=np.uint32))
X = P[S_i,:] # Shuffled individual allele frequencies

# Initiate factor matrices and containers
Q = np.random.rand(n, K).astype(np.float32, copy=False)
F = rng.random(size=(M, K), dtype=np.float32)
Q = rng.random(size=(N, K), dtype=np.float32)
Q /= np.sum(Q, axis=1, keepdims=True)
QA = np.zeros((n, K), dtype=np.float32)
QB = np.zeros((K, K), dtype=np.float32)
QC = np.zeros((n, K), dtype=np.float32)
F = np.random.rand(m, K).astype(np.float32, copy=False)
Q0 = np.zeros_like(Q)
QA = np.zeros_like(Q)
QC = np.zeros_like(Q)
FB = np.zeros((K, K), dtype=np.float32)
frob_vec = np.zeros(m, dtype=np.float32)
logVec = np.zeros(m)
QB = np.zeros((K, K), dtype=np.float32)

# Batch preparation
m_b = ceil(float(m)/batch)
bIndex = list(range(0, m, m_b))

# CSG-MU - cyclic mini-batch stochastic gradient descent
for i in range(iter):
Q0 = np.copy(Q)
for b in bIndex:
bEnd = min(b + m_b, m)
Xbatch = X[b:bEnd,:]
Fbatch = F[b:bEnd,:]
mInner = Fbatch.shape[0]
pF = 2*(1 + (n*mInner + n*K)//(mInner*K + mInner))
pQ = 2*(1 + (n*mInner + mInner*K)//(n*K + n))
for i in np.arange(iter):
memoryview(Q0.ravel())[:] = memoryview(Q.ravel())
for b in np.arange(batch):
bEnd = min((b+1)*M_b, M)
X_batch = X[(b*M_b):bEnd,:]
F_batch = F[(b*M_b):bEnd,:]
M_i = F_batch.shape[0]
pF = 2*(1 + (N*M_i + N*K)//(M_i*K + M_i))
pQ = 2*(1 + (N*M_i + M_i*K)//(N*K + N))

# Update F
FA = np.dot(Xbatch, Q)
FA = np.dot(X_batch, Q)
np.dot(Q.T, Q, out=FB)
for inner in range(pF):
F_prev = np.copy(Fbatch)
FC = np.dot(Fbatch, FB)
admixture_cy.updateF(Fbatch, FA, FC)
for inner in np.arange(pF):
F_prev = np.copy(F_batch)
FC = np.dot(F_batch, FB)
admixture_cy.updateF(F_batch, FA, FC)
if inner == 0:
F_init = shared_cy.frobenius(Fbatch, F_prev)
F_init = shared_cy.frobenius(F_batch, F_prev)
else:
if (shared_cy.frobenius(Fbatch, F_prev) <= (0.1*F_init)):
if (shared_cy.frobenius(F_batch, F_prev) <= (0.1*F_init)):
break

# Update Q
np.dot(Xbatch.T, Fbatch, out=QA)
np.dot(Fbatch.T, Fbatch, out=QB)
for inner in range(pQ):
np.dot(X_batch.T, F_batch, out=QA)
np.dot(F_batch.T, F_batch, out=QB)
for inner in np.arange(pQ):
Q_prev = np.copy(Q)
np.dot(Q, QB, out=QC)
admixture_cy.updateQ(Q, QA, QC, alpha)
Q /= np.sum(Q, axis=1, keepdims=True)
if inner == 0:
Q_init = shared_cy.frobenius(Q, Q_prev)
else:
Expand All @@ -76,44 +62,42 @@ def admixNMF(L, P, K, alpha, iter, tole, batch, seed, verbose, t):
# Measure difference
diff = shared_cy.rmse2d(Q, Q0)
if verbose:
print(f"CSG-MU ({i+1}).\tRMSE={np.round(diff,9)}")
print(f"CSG-MU ({i+1}).\tRMSE={diff:.9f}")
if diff < tole:
print("Converged.")
break
del X, Xbatch, Fbatch, QA, QB, QC, FA, FB, FC, Q_prev, F_prev
del X, X_batch, F_batch, QA, QB, QC, FA, FB, FC, Q_prev, F_prev

# Reshuffle columns of F
F = F[np.argsort(shuffleP)]
F = F[np.argsort(S_i)]

# Frobenius and log-likelihood
X = np.dot(F, Q.T)
shared_cy.frobeniusThread(X, P, frob_vec, t)
print(f"Frobenius error: {np.round(np.sqrt(np.sum(frob_vec)),5)}")
admixture_cy.loglike(L, X, logVec, t)
logLike = np.sum(logVec, dtype=float)
print(f"Log-likelihood: {np.round(logLike, 5)}")
del logVec
return Q, F, logLike
F_err = shared_cy.frobeniusMulti(X, P)
L_lik = admixture_cy.loglike(L, X)
print(f"Frobenius error: {F_err:.5f}")
print(f"Log-likelihood: {L_lik:.5f}")
return Q, F, L_lik

##### Automatic search for alpha #####
def alphaSearch(L, P, K, aEnd, iter, tole, batch, seed, depth, t):
def alphaSearch(L, P, K, aEnd, iter, tole, batch, depth, rng):
# First search
aMin = 0
aMax = aEnd
aMid = (aMin + aMax)/2.0
aStep = (aMin + aMax)/4.0
print(f"Depth=1,\tRunning Alpha={aMin}")
QB, FB, lB = admixNMF(L, P, K, aMin, iter, tole, batch, seed, False, t)
QB, FB, lB = admixNMF(L, P, K, aMin, iter, tole, batch, rng, False)
aL = 0
aB = aMin
print(f"Depth=1,\tRunning Alpha={aMid}")
QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, seed, False, t)
QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, rng, False)
if lT > lB:
QB, FB, lB = np.copy(QT), np.copy(FT), lT
aL = 1
aB = aMid
print(f"Depth=1,\tRunning Alpha={aMax}")
QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, seed, False, t)
QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, rng, False)
if lT > lB:
QB, FB, lB = np.copy(QT), np.copy(FT), lT
aL = 2
Expand All @@ -127,30 +111,36 @@ def alphaSearch(L, P, K, aEnd, iter, tole, batch, seed, depth, t):
aMid = [aMin, aMid, aMax][aL]
aMin = aMid - aStep
aMax = aMid + aStep
for d in range(depth-1):
for d in np.arange(depth-1):
if aMin == 0:
print(f"Depth={d+2},\tRunning Alpha={aMid}")
QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, seed, False, t)
QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, rng, False)
if lT > lB:
QB, FB, lB = np.copy(QT), np.copy(FT), lT
memoryview(QB.ravel())[:] = memoryview(QT)
memoryview(FB.ravel())[:] = memoryview(FT)
lB = lT
aL = 1
aB = aMid
else:
print(f"Depth={d+2},\tRunning Alpha={aMin}")
QT, FT, lT = admixNMF(L, P, K, aMin, iter, tole, batch, seed, False, t)
QT, FT, lT = admixNMF(L, P, K, aMin, iter, tole, batch, rng, False)
if lT > lB:
QB, FB, lB = np.copy(QT), np.copy(FT), lT
memoryview(QB.ravel())[:] = memoryview(QT)
memoryview(FB.ravel())[:] = memoryview(FT)
lB = lT
aL = 0
aB = aMin
else:
print(f"Depth={d+2},\tRunning Alpha={aMax}")
QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, seed, False, t)
QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, rng, False)
if lT > lB:
QB, FB, lB = np.copy(QT), np.copy(FT), lT
memoryview(QB.ravel())[:] = memoryview(QT)
memoryview(FB.ravel())[:] = memoryview(FT)
lB = lT
aL = 2
aB = aMax
else:
argL = 1
aL = 1
aStep /= 2.0
if aMin == 0:
aMax = aMid
Expand Down
Loading

0 comments on commit 58ccfd9

Please sign in to comment.