Major refactoring

Rosemeis · Jan 12, 2025 · 58ccfd9 · 58ccfd9
1 parent f90d41f
commit 58ccfd9
Show file tree

Hide file tree

Showing 18 changed files with 1,086 additions and 1,089 deletions.
diff --git a/README.md b/README.md
@@ -1,75 +1,63 @@
-# PCAngsd
-
-**Version 1.35**
+# PCAngsd (v1.36.0)
 
 Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.
 
-The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. *PCAngsd* can perform the following analyses:
+The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. `pcangsd` can be used for the following analyses:
 
 * Covariance matrix
 * Admixture estimation
-* Inbreeding coefficients (both per-individual and per-site)
+* Inbreeding coefficients (both per-sample and per-site)
 * HWE test
 * Genome-wide selection scans
 * Genotype calling
 * Estimate NJ tree of samples
 
 
-## Get PCAngsd and build
-### Dependencies
-The *PCAngsd* software relies on the following Python packages that you can install through conda (recommended) or pip:
-
-- numpy
-- cython
-- scipy
-
-You can create an environment through conda easily or install dependencies through pip as follows:
-```
-# Conda environment
-conda env create -f environment.yml
-
-# pip
-pip3 install --user -r requirements.txt
-```
-
-## Install and build
+## Installation
 ```bash
+# Option 1: Build and install via PyPI
+pip install pcangsd
+
+# Option 2: Download source and install via pip
 git clone https://github.com/Rosemeis/pcangsd.git
 cd pcangsd
-pip3 install .
-```
+pip install .
 
-You can now run *PCAngsd* with the `pcangsd` command.
+# Option 3: Download source and install in a new Conda environment
+git clone https://github.com/Rosemeis/pcangsd.git
+conda env create -f pcangsd/environment.yml
+conda activate pcangsd
+```
+You can now run the program with the `pcangsd` command.
 
 ## Usage
-### Running *PCAngsd*
-*PCAngsd* works directly on genotype likelihood files or PLINK files.
+`pcangsd` works directly on genotype likelihood files or PLINK files.
 ```bash
 # See all options
 pcangsd -h
 
 # Genotype likelihood file in Beagle format with 2 eigenvectors using 64 threads
-pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd
+pcangsd --beagle input.beagle.gz --eig 2 --threads 64 --out pcangsd
 # Outputs by default log-file (pcangsd.log) and covariance matrix (pcangsd.cov)
 
 # PLINK files (using file-prefix, *.bed, *.bim, *.fam)
-pcangsd -p input.plink -e 2 -t 64 -o pcangsd
+pcangsd --plink input.plink --eig 2 --threads 64 --out pcangsd
 
 # Perform PC-based selection scan and estimate admixture proportions
-pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd --selection --admix
+pcangsd --beagle input.beagle.gz --eig 2 --threads 64 --out pcangsd --selection --admix
 # Outputs the following files:
 # log-file (pcangsd.log)
 # covariance matrix (pcangsd.cov)
 # selection statistics (pcangsd.selection)
 # admixture proportions (pcangsd.admix.3.Q)
 # ancestral allele frequencies (pcangsd.admix.3.F)
 ```
-*PCAngsd* will output most files in text-format.
+`pcangsd` will output most files in text-format.
 
-Quick example of reading output and creating PCA plot in R:
+Quick example of reading output and creating PCA plot in *R*:
 ```R
 C <- as.matrix(read.table("pcangsd.cov")) # Reads estimated covariance matrix
-D <- as.matrix(read.table("pcangsd.selection")) # Reads PC based selection statistics
+D <- as.matrix(read.table("pcangsd.selection")) # Reads PC-based selection statistics
 
 # Plot PCA plot
 e <- eigen(C)
@@ -79,7 +67,7 @@ plot(e$vectors[,1:2], xlab="PC1", ylab="PC2", main="PCAngsd")
 p <- pchisq(D, 1, lower.tail=FALSE)
 ```
 
-Read files in python and create PCA plot using matplotlib:
+Read files in *python* and create PCA plot using matplotlib:
 ```python
 import matplotlib.pyplot as plt
 import numpy as np
@@ -102,9 +90,8 @@ p = chi2.sf(D, 1)
 
 Beagle genotype likelihood files can be generated from BAM files using [ANGSD](https://github.com/ANGSD/angsd). For inference of population structure in genotype data with non-random missigness, we recommend our [EMU](https://github.com/Rosemeis/emu) software that performs accelerated EM-PCA, however with fewer functionalities than *PCAngsd*.
 
-
 ## Citation
-Please cite our papers:
+Please cite our papers if you use the `pcangsd` framework:
 
 Population structure: [Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data](http://www.genetics.org/content/210/2/719).\
 HWE test: [Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019).\

diff --git a/environment.yml b/environment.yml
@@ -1,8 +1,9 @@
 name: pcangsd
-channels:
-    - defaults
 dependencies:
-    - python
-    - numpy
-    - scipy
-    - cython
+  - python>=3.10
+  - cython>3.0.0
+  - numpy>2.0.0
+  - scipy>1.14.0
+  - pip
+  - pip:
+    - pcangsd
diff --git a/pcangsd/admixture.py b/pcangsd/admixture.py
@@ -1,72 +1,58 @@
-"""
-PCAngsd.
-Estimate admixture proportions and ancestral allele frequencies.
-"""
-
-__author__ = "Jonas Meisner"
-
-# libraries
 import numpy as np
 from math import ceil
-
-# Import scripts
 from pcangsd import shared_cy
 from pcangsd import admixture_cy
 
 ##### Admixture estimation #####
-def admixNMF(L, P, K, alpha, iter, tole, batch, seed, verbose, t):
-	m, n = P.shape
-	np.random.seed(seed)
-	shuffleP = np.random.permutation(m)
-	X = P[shuffleP,:] # Shuffled individual allele frequencies
+def admixNMF(L, P, K, alpha, iter, tole, batch, rng, verbose):
+	M, N = P.shape
+	M_b = ceil(float(M)/batch)
+	S_i = rng.permuted(np.arange(M, dtype=np.uint32))
+	X = P[S_i,:] # Shuffled individual allele frequencies
 
 	# Initiate factor matrices and containers
-	Q = np.random.rand(n, K).astype(np.float32, copy=False)
+	F = rng.random(size=(M, K), dtype=np.float32)
+	Q = rng.random(size=(N, K), dtype=np.float32)
 	Q /= np.sum(Q, axis=1, keepdims=True)
-	QA = np.zeros((n, K), dtype=np.float32)
-	QB = np.zeros((K, K), dtype=np.float32)
-	QC = np.zeros((n, K), dtype=np.float32)
-	F = np.random.rand(m, K).astype(np.float32, copy=False)
+	Q0 = np.zeros_like(Q)
+	QA = np.zeros_like(Q)
+	QC = np.zeros_like(Q)
 	FB = np.zeros((K, K), dtype=np.float32)
-	frob_vec = np.zeros(m, dtype=np.float32)
-	logVec = np.zeros(m)
+	QB = np.zeros((K, K), dtype=np.float32)
 
 	# Batch preparation
-	m_b = ceil(float(m)/batch)
-	bIndex = list(range(0, m, m_b))
 
 	# CSG-MU - cyclic mini-batch stochastic gradient descent
-	for i in range(iter):
-		Q0 = np.copy(Q)
-		for b in bIndex:
-			bEnd = min(b + m_b, m)
-			Xbatch = X[b:bEnd,:]
-			Fbatch = F[b:bEnd,:]
-			mInner = Fbatch.shape[0]
-			pF = 2*(1 + (n*mInner + n*K)//(mInner*K + mInner))
-			pQ = 2*(1 + (n*mInner + mInner*K)//(n*K + n))
+	for i in np.arange(iter):
+		memoryview(Q0.ravel())[:] = memoryview(Q.ravel())
+		for b in np.arange(batch):
+			bEnd = min((b+1)*M_b, M)
+			X_batch = X[(b*M_b):bEnd,:]
+			F_batch = F[(b*M_b):bEnd,:]
+			M_i = F_batch.shape[0]
+			pF = 2*(1 + (N*M_i + N*K)//(M_i*K + M_i))
+			pQ = 2*(1 + (N*M_i + M_i*K)//(N*K + N))
 
 			# Update F
-			FA = np.dot(Xbatch, Q)
+			FA = np.dot(X_batch, Q)
 			np.dot(Q.T, Q, out=FB)
-			for inner in range(pF):
-				F_prev = np.copy(Fbatch)
-				FC = np.dot(Fbatch, FB)
-				admixture_cy.updateF(Fbatch, FA, FC)
+			for inner in np.arange(pF):
+				F_prev = np.copy(F_batch)
+				FC = np.dot(F_batch, FB)
+				admixture_cy.updateF(F_batch, FA, FC)
 				if inner == 0:
-					F_init = shared_cy.frobenius(Fbatch, F_prev)
+					F_init = shared_cy.frobenius(F_batch, F_prev)
 				else:
-					if (shared_cy.frobenius(Fbatch, F_prev) <= (0.1*F_init)):
+					if (shared_cy.frobenius(F_batch, F_prev) <= (0.1*F_init)):
 						break
 
 			# Update Q
-			np.dot(Xbatch.T, Fbatch, out=QA)
-			np.dot(Fbatch.T, Fbatch, out=QB)
-			for inner in range(pQ):
+			np.dot(X_batch.T, F_batch, out=QA)
+			np.dot(F_batch.T, F_batch, out=QB)
+			for inner in np.arange(pQ):
 				Q_prev = np.copy(Q)
 				np.dot(Q, QB, out=QC)
 				admixture_cy.updateQ(Q, QA, QC, alpha)
-				Q /= np.sum(Q, axis=1, keepdims=True)
 				if inner == 0:
 					Q_init = shared_cy.frobenius(Q, Q_prev)
 				else:
@@ -76,44 +62,42 @@ def admixNMF(L, P, K, alpha, iter, tole, batch, seed, verbose, t):
 		# Measure difference
 		diff = shared_cy.rmse2d(Q, Q0)
 		if verbose:
-			print(f"CSG-MU ({i+1}).\tRMSE={np.round(diff,9)}")
+			print(f"CSG-MU ({i+1}).\tRMSE={diff:.9f}")
 		if diff < tole:
 			print("Converged.")
 			break
-	del X, Xbatch, Fbatch, QA, QB, QC, FA, FB, FC, Q_prev, F_prev
+	del X, X_batch, F_batch, QA, QB, QC, FA, FB, FC, Q_prev, F_prev
 
 	# Reshuffle columns of F
-	F = F[np.argsort(shuffleP)]
+	F = F[np.argsort(S_i)]
 
 	# Frobenius and log-likelihood
 	X = np.dot(F, Q.T)
-	shared_cy.frobeniusThread(X, P, frob_vec, t)
-	print(f"Frobenius error: {np.round(np.sqrt(np.sum(frob_vec)),5)}")
-	admixture_cy.loglike(L, X, logVec, t)
-	logLike = np.sum(logVec, dtype=float)
-	print(f"Log-likelihood: {np.round(logLike, 5)}")
-	del logVec
-	return Q, F, logLike
+	F_err = shared_cy.frobeniusMulti(X, P)
+	L_lik = admixture_cy.loglike(L, X)
+	print(f"Frobenius error: {F_err:.5f}")
+	print(f"Log-likelihood: {L_lik:.5f}")
+	return Q, F, L_lik
 
 ##### Automatic search for alpha #####
-def alphaSearch(L, P, K, aEnd, iter, tole, batch, seed, depth, t):
+def alphaSearch(L, P, K, aEnd, iter, tole, batch, depth, rng):
 	# First search
 	aMin = 0
 	aMax = aEnd
 	aMid = (aMin + aMax)/2.0
 	aStep = (aMin + aMax)/4.0
 	print(f"Depth=1,\tRunning Alpha={aMin}")
-	QB, FB, lB = admixNMF(L, P, K, aMin, iter, tole, batch, seed, False, t)
+	QB, FB, lB = admixNMF(L, P, K, aMin, iter, tole, batch, rng, False)
 	aL = 0
 	aB = aMin
 	print(f"Depth=1,\tRunning Alpha={aMid}")
-	QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, seed, False, t)
+	QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, rng, False)
 	if lT > lB:
 		QB, FB, lB = np.copy(QT), np.copy(FT), lT
 		aL = 1
 		aB = aMid
 	print(f"Depth=1,\tRunning Alpha={aMax}")
-	QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, seed, False, t)
+	QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, rng, False)
 	if lT > lB:
 		QB, FB, lB = np.copy(QT), np.copy(FT), lT
 		aL = 2
@@ -127,30 +111,36 @@ def alphaSearch(L, P, K, aEnd, iter, tole, batch, seed, depth, t):
 		aMid = [aMin, aMid, aMax][aL]
 		aMin = aMid - aStep
 		aMax = aMid + aStep
-	for d in range(depth-1):
+	for d in np.arange(depth-1):
 		if aMin == 0:
 			print(f"Depth={d+2},\tRunning Alpha={aMid}")
-			QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, seed, False, t)
+			QT, FT, lT = admixNMF(L, P, K, aMid, iter, tole, batch, rng, False)
 			if lT > lB:
-				QB, FB, lB = np.copy(QT), np.copy(FT), lT
+				memoryview(QB.ravel())[:] = memoryview(QT)
+				memoryview(FB.ravel())[:] = memoryview(FT)
+				lB = lT
 				aL = 1
 				aB = aMid
 		else:
 			print(f"Depth={d+2},\tRunning Alpha={aMin}")
-			QT, FT, lT = admixNMF(L, P, K, aMin, iter, tole, batch, seed, False, t)
+			QT, FT, lT = admixNMF(L, P, K, aMin, iter, tole, batch, rng, False)
 			if lT > lB:
-				QB, FB, lB = np.copy(QT), np.copy(FT), lT
+				memoryview(QB.ravel())[:] = memoryview(QT)
+				memoryview(FB.ravel())[:] = memoryview(FT)
+				lB = lT
 				aL = 0
 				aB = aMin
 			else:
 				print(f"Depth={d+2},\tRunning Alpha={aMax}")
-				QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, seed, False, t)
+				QT, FT, lT = admixNMF(L, P, K, aMax, iter, tole, batch, rng, False)
 				if lT > lB:
-					QB, FB, lB = np.copy(QT), np.copy(FT), lT
+					memoryview(QB.ravel())[:] = memoryview(QT)
+					memoryview(FB.ravel())[:] = memoryview(FT)
+					lB = lT
 					aL = 2
 					aB = aMax
 				else:
-					argL = 1
+					aL = 1
 		aStep /= 2.0
 		if aMin == 0:
 			aMax = aMid