ABC 4 pop

Model
Parameters
Summary statistics
Usage
Results

Model

ABC_4pop is made to investigate various models of speciation between four populations/species/gene-pools. The following topology of the species tree is the only assumed to date: ( (A, B), (C, D) ). Alternative models only differ by their temporal + geographical patterns of gene flow.
Thus, gene flow can be unidirectional ( A→C ) or bidirectional ( A↔C ).

Parameters

Tsplit_AB : time of split between population A and population B. Units of times are in 4.N generations where N is the effective population size of the reference population (better explained in the original document describing ms found here)
Tsplit_CD : time of split between population C and population D.
Tsplit : time of split between clade AB and clade CD.

T_SC_AC : time of secondary contact between population A and population C. Gene flow occured at rates M_AC and M_CA between the present time and T_SC_AC. Those two migration rates are equal to zero for periods older than T_SC_AC (backward in time). Units of migration rates are 4.N.m (better explained in the original document describing ms found here)
T_SC_BD : time of secondary contact between population B and population D.

N_popA, N_popB, N_popC and N_popD are the effective population sizes of the current populations A, B, C and D.
Na_AB is the effective population size of the ancestral population of A and B.
Na is the effective population size of the ancestral population of A, B, C and D.

Prior distributions will feed the coalescent simulator msnsam, and are generated by priorgen. Priorgen does not have to be executed by users, but can be easily modified by them in order to change the prior boundaries, some probability distributions (uniform, Beta, normal, etc ... ).
Values for all parameters are written in the file priorfile.txt, one line per multilocus simulations.

Summary statistics

Summary statistics are directly computed from the msnsam's output. For each locus, mscalc will compute:

Statistics	Description
bialsites	number of SNPs in the alignment
sf XY	number of fixed differences between species X and Y / locus length
sx X	number of exclusively polymorphic positions in species X / locus length
ss XY	number of shared biallelic positions between species X and Y/ locus length
pi X	Tajima’s Theta within species X
theta X	Watterson’s Theta witin species X
pearson_r_pi XY	correlation’s coefficient for pi over orthologs between X and Y
pearson_r_theta XY	correlation’s coefficient for theta over orthologs between X and Y
Dtaj X	Tajima’s D for species X
div XY	raw divergence Dxy measured between X and Y
netdiv XY	net divergence Da measured between X and Y
minDiv XY	smallest divergence measured between one individual from X and one from Y
maxDiv XY	highest divergence measured between one individual from X and one from Y
Gmin XY	minimum divergence between one sequence from X and one from Y minDivXY divided by the average divXY
Gmax XY	maxDivXY/divXY
FST XY	FST between X and Y compute as 1-(pi_X + pi_Z) / (2*pi_XY)
D XY_Z	ABBA-BABA’s D statistics where pop_1 = X, pop_2 = Y and pop_3 = Z
fd XY_Z	ABBA-BABA’s fd statistics where pop_1 = X, pop_2 = Y and pop_3 = Z

An array of statistics corresponding to the average statistics computed over loci and their standard deviation will be returned every multilocus simulation and written in the file ABCstat.txt.

Usage

To run the simulations, simply use the following command:

ABC_4pop.py [model] [migration] [number of multilocus simulations]  
  
Ex: ABC_4pop.py SC_2M_2N AC 10

This example will run 10 multilocus simulations, with a secondary contact between population A and C.
Migration is heterogeneous over the genome (2M) as well as the effective population size (2N).
All of the genomic heterogeneities are modeled by a rescaled Beta distribution.

ABC_4pop.py will simply execute the pipeline

priorgen | msnsam | mscalc

The output files are ABCstat.txt (containing the computed summary statistics) and priorfile.txt (containing the parameter values used to simulate the data from which the summary statistics were calculated).

Model in [SI, SC_1M_1N, SC_2M_1N, SC_1M_2N, SC_2M_2N]
Migration in [none, A, B, C, D, AC, BD, ACBD]

This package requires:
pypy (has to be linked to the user's bin)
numpy
mscalc_4pop.py (has to be linked to the user's bin)
priorgen_4pop.py (has to be linked to the user's bin)
ABC_4pop.py (has to be linked to the user's bin)
msnsam

Statistical comparison between "observation" and "simulations" can be made using various R libraries (abc, abcrf).

Results

Testing for 'ongoing migration' versus 'current isolation'

Four models were compared: secondary contacts (i) between A and C, ii) between A/C and B/D, iii) between B and D) and strict isolation (SI).
The classification error of the ABC approach when it comes to specifically comparing these 4 models are shown in the following table:

Classification error corresponds to the percentage over 5,000 of simulated datasets under the model Mi that were not statistically supported as being Mi. Those errors are lying between 1.5% and 2.76%.

Testing for the directionality of introgression

For a given pair of gene pools A and C, one can easily test whether the introgression occurred unidirectionally (A→C or A←C) versus bidirectionally (A↔C).

Confusion matrix (using all statistics)

	SC A←C	SC A↔C	SC A↔C and B↔D	SC B←D	SC B↔D	SC A→C	SC B→D	SI	class.error
SC A←C	4693	108	28	3	4	65	1	98	0.0614
SC A↔C	307	4298	16	3	3	338	2	33	0.1404
SC A↔C and B↔ D	25	30	4844	17	32	13	22	17	0.0312
SC B←D	6	6	20	4734	105	2	50	77	0.0532
SC B↔D	8	6	19	327	4282	1	318	39	0.1436
SC A→C	60	101	24	2	3	4734	2	74	0.0532
SC B→D	13	7	14	48	105	9	4725	79	0.055
SI	20	13	19	13	12	12	9	4902	0.0196

Adding the unidirectional models increased the classification error for bidirectional models compared to the previous analysis.
The following figures show the join distributions for M(A←C) and M(A→C) of the datasets simulated under the model "A↔C" but supported as being:
"A↔C"
"A←C"
"A→C"

Figure A: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were correctly supported as being the A↔C model.
Figure B: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were uncorrectly supported as being the A←C model.
Figure C: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were uncorrectly supported as being the A→C model.
Cases where the difference between the two migration rates is large will be interpreted as unidirectional. Thus, the ABC analysis tends to consider an "A↔C" model as being "A→C" when the migration rate M(A→C) is much greater than M(A←C).
Even if the model is labelled as bidirectional, randomly chosen values for migration rates produce some simulated datasets for which migration is biologically unidirectional.

Effects of statistics on model confusion

The same model comparison was also realized by removing statistics depending on an available sequenced outgroup (no ABBA-BABA stats analysis), by removing statistics depending on the identification of gametic phases (no Gmin stats analysis) or by only keeping statistics describing a folded site-frequency-spectrum (no ABBA-BABA and Gmin stats analysis).
For each analysis, the error rate in model classification was also measured for 5,000 of randomly simulated datasets under each of the 8 alternative models. This error rate simply corresponds to the rate of misclassification.

Targets	All stats	No ABBA-BABA stats	No Gmin stats	No ABBA-BABA and Gmin stats
SC A← C	0.0614	0.0788	0.064	0.0942
SC A ↔ C	0.1404	0.1612	0.1586	0.1768
SC A ↔ C and B ↔ D	0.0312	0.0334	0.0474	0.0442
SC B ← D	0.0532	0.0736	0.06	0.0938
SC B ↔ D	0.1436	0.1608	0.16	0.1714
SC A → C	0.0532	0.0752	0.0654	0.0966
SC B → D	0.055	0.074	0.064	0.0956
SI	0.0196	0.0234	0.0212	0.0396

Removing ABBA-BABA or Gmin statistics will individually increase the classification error by a low factor, however, this reduction begins to be worrisome when both categories of statistics are neglected. It should always be borne in mind that model errors concern border line cases between two models.

ABBA-BABA fd and D statistics as well as Gmin have an important power to classify models, i.e, they reduce the dispersion of models along branches of each decision tree (measured by the Gini index).

However, it is important to note that these statistics are either sensitive to an available outgroup (ABBA-BABA), or are dependent on good quality inferences of haplotype phases (Gmin).
Although they are the most informative on simulated data, biases in their measurements on real data can induce biases in inferences. Excluding these statistics does not greatly reduce inferential power. The remaining 111 statistics do not depend as much on external groups or third-party inferences, and they contain enough combined information to make model selection.
The choice of whether or not to include them in the inferences should belong only to the experimenter.

Estimating the parameters

SI model

Parameters of the SI model were estimated for 500 simulated datasets.

Each point represent a simulated dataset.
The x-axis shows the real value to estimate.
The y-axis shows the estimated value by using a random-forest approach.
The red line is a line of slope 1 passing through the origin.

Results for current populations are shown only for population A, and shows similar relationships between real values and estimated values for B, C and D.
In the same way for Na_AB and Na_CD, and for Tsplit_AB and Tsplit_CD.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
pictures		pictures
produce_input_files		produce_input_files
ABC_4pop.py		ABC_4pop.py
ABC_4pop_GoF_posterior.py		ABC_4pop_GoF_posterior.py
ABC_4pop_v2.py		ABC_4pop_v2.py
README.md		README.md
bpfile		bpfile
mscalc_4pop.py		mscalc_4pop.py
mscalc_4pop_monolocus.py		mscalc_4pop_monolocus.py
priorgen_4pop.py		priorgen_4pop.py
priorgen_4pop_GoF_posterior.py		priorgen_4pop_GoF_posterior.py
priorgen_4pop_v2.py		priorgen_4pop_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABC 4 pop

Model

Parameters

Summary statistics

Usage

Results

Testing for 'ongoing migration' versus 'current isolation'

Testing for the directionality of introgression

Confusion matrix (using all statistics)

Effects of statistics on model confusion

Estimating the parameters

SI model

Effect of gene flow on parameters estimates

Effective population size of population A

Effective population size of the ancestral population

Effective population size of the ancestral population between A and B

Time of split between (A,B) and (C,D)

Time of split between A and B

About

Releases

Packages

Contributors 2

Languages

popgenomics/ABC_4pop

Folders and files

Latest commit

History

Repository files navigation

ABC 4 pop

Model

Parameters

Summary statistics

Usage

Results

Testing for 'ongoing migration' versus 'current isolation'

Testing for the directionality of introgression

Confusion matrix (using all statistics)

Effects of statistics on model confusion

Estimating the parameters

SI model

Effect of gene flow on parameters estimates

Effective population size of population A

Effective population size of the ancestral population

Effective population size of the ancestral population between A and B

Time of split between (A,B) and (C,D)

Time of split between A and B

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages