Skip to content

ABC analysis to investigate speciation by using genomics data from four populations allowing for genomic heterogeneities in Ne and M

Notifications You must be signed in to change notification settings

popgenomics/ABC_4pop

Repository files navigation

ABC 4 pop

Model

ABC_4pop is made to investigate various models of speciation between four populations/species/gene-pools. The following topology of the species tree is the only assumed to date: ( (A, B), (C, D) ). Alternative models only differ by their temporal + geographical patterns of gene flow.
Thus, gene flow can be unidirectional ( A→C ) or bidirectional ( A↔C ).

model

Parameters

Tsplit_AB : time of split between population A and population B. Units of times are in 4.N generations where N is the effective population size of the reference population (better explained in the original document describing ms found here)
Tsplit_CD : time of split between population C and population D.
Tsplit : time of split between clade AB and clade CD.

T_SC_AC : time of secondary contact between population A and population C. Gene flow occured at rates M_AC and M_CA between the present time and T_SC_AC. Those two migration rates are equal to zero for periods older than T_SC_AC (backward in time). Units of migration rates are 4.N.m (better explained in the original document describing ms found here)
T_SC_BD : time of secondary contact between population B and population D.

N_popA, N_popB, N_popC and N_popD are the effective population sizes of the current populations A, B, C and D.
Na_AB is the effective population size of the ancestral population of A and B.
Na is the effective population size of the ancestral population of A, B, C and D.

Prior distributions will feed the coalescent simulator msnsam, and are generated by priorgen. Priorgen does not have to be executed by users, but can be easily modified by them in order to change the prior boundaries, some probability distributions (uniform, Beta, normal, etc ... ).
Values for all parameters are written in the file priorfile.txt, one line per multilocus simulations.

Summary statistics

Summary statistics are directly computed from the msnsam's output. For each locus, mscalc will compute:

Statistics Description
bialsites number of SNPs in the alignment
sf XY number of fixed differences between species X and Y / locus length
sx X number of exclusively polymorphic positions in species X / locus length
ss XY number of shared biallelic positions between species X and Y/ locus length
pi X Tajima’s Theta within species X
theta X Watterson’s Theta witin species X
pearson_r_pi XY correlation’s coefficient for pi over orthologs between X and Y
pearson_r_theta XY correlation’s coefficient for theta over orthologs between X and Y
Dtaj X Tajima’s D for species X
div XY raw divergence Dxy measured between X and Y
netdiv XY net divergence Da measured between X and Y
minDiv XY smallest divergence measured between one individual from X and one from Y
maxDiv XY highest divergence measured between one individual from X and one from Y
Gmin XY minimum divergence between one sequence from X and one from Y minDivXY divided by the average divXY
Gmax XY maxDivXY/divXY
FST XY FST between X and Y compute as 1-(pi_X + pi_Z) / (2*pi_XY)
D XY_Z ABBA-BABA’s D statistics where pop_1 = X, pop_2 = Y and pop_3 = Z
fd XY_Z ABBA-BABA’s fd statistics where pop_1 = X, pop_2 = Y and pop_3 = Z

An array of statistics corresponding to the average statistics computed over loci and their standard deviation will be returned every multilocus simulation and written in the file ABCstat.txt.

Usage

To run the simulations, simply use the following command:

ABC_4pop.py [model] [migration] [number of multilocus simulations]  
  
Ex: ABC_4pop.py SC_2M_2N AC 10  

This example will run 10 multilocus simulations, with a secondary contact between population A and C.
Migration is heterogeneous over the genome (2M) as well as the effective population size (2N).
All of the genomic heterogeneities are modeled by a rescaled Beta distribution.

ABC_4pop.py will simply execute the pipeline

priorgen | msnsam | mscalc

The output files are ABCstat.txt (containing the computed summary statistics) and priorfile.txt (containing the parameter values used to simulate the data from which the summary statistics were calculated).

Model in [SI, SC_1M_1N, SC_2M_1N, SC_1M_2N, SC_2M_2N]
Migration in [none, A, B, C, D, AC, BD, ACBD]

This package requires:
pypy (has to be linked to the user's bin)
numpy
mscalc_4pop.py (has to be linked to the user's bin)
priorgen_4pop.py (has to be linked to the user's bin)
ABC_4pop.py (has to be linked to the user's bin)
msnsam

Statistical comparison between "observation" and "simulations" can be made using various R libraries (abc, abcrf).

Results

Testing for 'ongoing migration' versus 'current isolation'

Four models were compared: secondary contacts (i) between A and C, ii) between A/C and B/D, iii) between B and D) and strict isolation (SI).
The classification error of the ABC approach when it comes to specifically comparing these 4 models are shown in the following table:
confusion matrix
Classification error corresponds to the percentage over 5,000 of simulated datasets under the model Mi that were not statistically supported as being Mi. Those errors are lying between 1.5% and 2.76%.

Testing for the directionality of introgression

For a given pair of gene pools A and C, one can easily test whether the introgression occurred unidirectionally (A→C or A←C) versus bidirectionally (A↔C).

Confusion matrix (using all statistics)

SC A←C SC A↔C SC A↔C and B↔D SC B←D SC B↔D SC A→C SC B→D SI class.error
SC A←C 4693 108 28 3 4 65 1 98 0.0614
SC A↔C 307 4298 16 3 3 338 2 33 0.1404
SC A↔C and B↔ D 25 30 4844 17 32 13 22 17 0.0312
SC B←D 6 6 20 4734 105 2 50 77 0.0532
SC B↔D 8 6 19 327 4282 1 318 39 0.1436
SC A→C 60 101 24 2 3 4734 2 74 0.0532
SC B→D 13 7 14 48 105 9 4725 79 0.055
SI 20 13 19 13 12 12 9 4902 0.0196

Adding the unidirectional models increased the classification error for bidirectional models compared to the previous analysis.
The following figures show the join distributions for M(A←C) and M(A→C) of the datasets simulated under the model "A↔C" but supported as being:
"A↔C"
"A←C"
"A→C"
space_param

Figure A: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were correctly supported as being the A↔C model.
Figure B: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were uncorrectly supported as being the A←C model.
Figure C: Combinations of the two migration rates (M(A←C) and M(A→C)) producing datasets that were uncorrectly supported as being the A→C model.
Cases where the difference between the two migration rates is large will be interpreted as unidirectional. Thus, the ABC analysis tends to consider an "A↔C" model as being "A→C" when the migration rate M(A→C) is much greater than M(A←C).
Even if the model is labelled as bidirectional, randomly chosen values for migration rates produce some simulated datasets for which migration is biologically unidirectional.

Effects of statistics on model confusion

The same model comparison was also realized by removing statistics depending on an available sequenced outgroup (no ABBA-BABA stats analysis), by removing statistics depending on the identification of gametic phases (no Gmin stats analysis) or by only keeping statistics describing a folded site-frequency-spectrum (no ABBA-BABA and Gmin stats analysis).
For each analysis, the error rate in model classification was also measured for 5,000 of randomly simulated datasets under each of the 8 alternative models. This error rate simply corresponds to the rate of misclassification.

Targets All stats No ABBA-BABA stats No Gmin stats No ABBA-BABA and Gmin stats
SC A← C 0.0614 0.0788 0.064 0.0942
SC A ↔ C 0.1404 0.1612 0.1586 0.1768
SC A ↔ C and B ↔ D 0.0312 0.0334 0.0474 0.0442
SC B ← D 0.0532 0.0736 0.06 0.0938
SC B ↔ D 0.1436 0.1608 0.16 0.1714
SC A → C 0.0532 0.0752 0.0654 0.0966
SC B → D 0.055 0.074 0.064 0.0956
SI 0.0196 0.0234 0.0212 0.0396

Removing ABBA-BABA or Gmin statistics will individually increase the classification error by a low factor, however, this reduction begins to be worrisome when both categories of statistics are neglected. It should always be borne in mind that model errors concern border line cases between two models.

ABBA-BABA fd and D statistics as well as Gmin have an important power to classify models, i.e, they reduce the dispersion of models along branches of each decision tree (measured by the Gini index).
variable_importance

However, it is important to note that these statistics are either sensitive to an available outgroup (ABBA-BABA), or are dependent on good quality inferences of haplotype phases (Gmin).
Although they are the most informative on simulated data, biases in their measurements on real data can induce biases in inferences. Excluding these statistics does not greatly reduce inferential power. The remaining 111 statistics do not depend as much on external groups or third-party inferences, and they contain enough combined information to make model selection.
The choice of whether or not to include them in the inferences should belong only to the experimenter.

Estimating the parameters

SI model

Parameters of the SI model were estimated for 500 simulated datasets.
parameters_SI
Each point represent a simulated dataset.
The x-axis shows the real value to estimate.
The y-axis shows the estimated value by using a random-forest approach.
The red line is a line of slope 1 passing through the origin.

Results for current populations are shown only for population A, and shows similar relationships between real values and estimated values for B, C and D.
In the same way for Na_AB and Na_CD, and for Tsplit_AB and Tsplit_CD.

Effect of gene flow on parameters estimates

Effective population size of population A

parameters_tsplit_5models

Effective population size of the ancestral population

parameters_tsplit_5models

Effective population size of the ancestral population between A and B

parameters_tsplit_5models

Time of split between (A,B) and (C,D)

parameters_tsplit_5models

Time of split between A and B

parameters_tsplit_5models

About

ABC analysis to investigate speciation by using genomics data from four populations allowing for genomic heterogeneities in Ne and M

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages