Skip to content
This repository was archived by the owner on Feb 6, 2024. It is now read-only.

Substitution Models

minh edited this page Nov 14, 2015 · 59 revisions

All common substitution models and usages.

Table of Contents

IQ-TREE supports a wide range of substitution models, including advanced partition and mixture models. This guide gives a detailed information of all available models.

TIP: If you do not know which model to use, simply run IQ-TREE with the standard model selection (-m TEST option) or the new model selection procedure (-m TESTNEW). It automatically determines best-fit model for your data.

DNA models

Base substitution rates

IQ-TREE includes all common DNA models (ordered by complexity):

Model Explanation
JC or JC69 Equal substitution rates and equal base frequencies (Jukes and Cantor, 1969).
F81 Equal rates but unequal base freq. (Felsenstein, 1981).
K80 or K2P Unequal transition/transversion rates and equal base freq. (Kimura, 1980).
HKY or HKY85 Unequal transition/transversion rates and unequal base freq. (Hasegawa, Kishino and Yano, 1985).
TN or TN93 Like HKY but unequal purine/pyrimidine rates (Tamura and Nei, 1993).
TNe Like TN but equal base freq.
K81 or K3P Three substitution types model and equal base freq. (Kimura, 1981).
K81u Like K81 but unequal base freq.
TPM2 AC=AT, AG=CT, CG=GT and equal base freq.
TPM2u Like TPM2 but unequal base freq.
TPM3 AC=CG, AG=CT, AT=GT and equal base freq.
TPM3u Like TPM3 but unequal base freq.
TIM Transition model, AC=GT, AT=CG and unequal base freq.
TIMe Like TIM but equal base freq.
TIM2 AC=AT, CG=GT and unequal base freq.
TIM2e Like TIM2 but equal base freq.
TIM3 AC=CG, AT=GT and unequal base freq.
TIM3e Like TIM3 but equal base freq.
TVM Transversion model, AG=CT and unequal base freq.
TVMe Like TVM but equal base freq.
SYM Symmetric model with unequal rates and equal base freq. (Zharkihk, 1994).
GTR General time reversible model with unequal rates and unequal base freq. (Tavare, 1986).

Moreover, IQ-TREE supports arbitrarily restricted DNA model via a 6-digit code. The 6 digits define the equality for 6 nucleotide substitution rates: A-C, A-G, A-T, C-G, C-T and G-T. 010010 means that A-G rate is equal to C-T rate and the remaining four substitution rates are equal. Thus, 010010 is equivalent to K80 or HKY model (depending on whether base frequencies are equal or not). 123450 is equivalent to GTR or SYM model as there is no restriction defined by such 6-digit code.

If users want to fix model parameters, append the model name with a curly bracket {, followed by the comma-separated rate parameters, and a closing curly bracket }. For example, GTR{1.0,2.0,1.5,3.7,2.8,1.0} specifies 6 substitution rates A-C=1.0, A-G=2.0, A-T=1.5, C-G=3.7, C-T=2.8 and G-T=1.0.

Base frequencies

Users can specify three different kinds of base frequencies:

FreqType Explanation
+F Empirical base frequencies. This is the default if the model has unequal base freq.
+FQ Equal base frequencies.
+FO Optimized base frequencies by maximum-likelihood.

For example, GTR+FO optimizes base frequencies by ML whereas GTR+F (default) counts base frequencies directly from the alignment.

Finally, users can fix base frequencies with e.g. GTR+F{0.1,0.2,0.3,0.4} to fix the corresponding frequencies of A, C, G and T (must sum up to 1.0).

Protein models

Amino-acid exchange rate matrices

IQ-TREE supports all common empirical amino-acid exchange rate matrices:

Model Explanation
BLOSUM62 BLOcks SUbstitution Matrix (Henikoff and Henikoff, 1992). Note that BLOSUM62 is not recommended as it was designed mainly for sequence alignments.
cpREV chloroplast matrix (Adachi et al., 2000).
Dayhoff General matrix (Dayhoff et al., 1978).
DCMut Revised Dayhoff matrix (Kosiol and Goldman, 2005).
FLU Influenza virus (Dang et al., 2010).
HIVb HIV matrix (Dang et al., 2010).
HIVw HIV matrix (Dang et al., 2010).
JTT General matrix (Jones et al., 1992).
JTTDCMut Revised JTT matrix (Kosiol and Goldman, 2005).
LG General matrix (Le and Gascuel, 2008).
mtART Mitochondrial Arthropoda (Abascal et al., 2007).
mtMAM Mitochondrial Mammalia (Yang et al., 1998).
mtREV Mitochondrial Verterbrate (Adachi and Hasegawa, 1996).
mtZOA Mitochondrial Metazoa (Animals) (Rota-Stabelli et al., 2009).
Poisson Equal amino-acid exchange rates and frequencies.
PMB Probability Matrix from Blocks, revised BLOSUM matrix (Veerassamy et al., 2004).
rtREV Retrovirus (Dimmic et al., 2002).
VT General matrix (Mueller and Vingron, 2000).
WAG General matrix (Whelan and Goldman, 2001).

Moreover, IQ-TREE also supports a series of protein mixture models:

Model Explanation
C10, ..., C60 10- to 60-profile mixture models under Gamma rate heterogeneity (Le et al., 2008a).
EX2 Two-matrix model for exposed/buried AA sites (Le et al., 2008b).
EX3 Three-matrix model for highly exposed/intermediate/buried AA sites (Le et al., 2008b).
EHO Three-matrix model for extended/helix/other sites (Le et al., 2008b).
UL2, UL3 Unsupervised-learning variants of EX2 and EX3, respectively.
EX_EHO Six-matrix model combining EX2 and EHO (Le and Gascuel, 2010).
LG4M Four-matrix model fused with Gamma rate heterogeneity (Le et al., 2012).
LG4X Four-matrix model fused with FreeRate heterogeneity (Le et al., 2012).
CF4 Five-profile mixture model (Wang et al., 2008).

One can even combine a protein matrix with a profile mixture model like:

  • LG+C20: Applying LG matrix for all 20 mixture classes.
  • JTT+CF4+G: Applying JTT matrix for all 5 mixture classes and Gamma rate heteorogeneity.
  • JTTCF4G: Alias for JTT+CF4+G.

Moreover, one can override the Gamma rate by FreeRate heterogeneity:

  • LG+C20+R4: Like LG+C20 but replace Gamma by FreeRate heterogeneity.

If the matrix name does not match any of the above listed models, IQ-TREE assumes that it is a file containing AA exchange rates and frequencies in PAML format. It contains the lower diagonal part of the matrix and 20 AA frequencies, e.g.:

0.425093 
0.276818 0.751878 
0.395144 0.123954 5.076149 
2.489084 0.534551 0.528768 0.062556 
0.969894 2.807908 1.695752 0.523386 0.084808 
1.038545 0.363970 0.541712 5.243870 0.003499 4.128591 
2.066040 0.390192 1.437645 0.844926 0.569265 0.267959 0.348847 
0.358858 2.426601 4.509238 0.927114 0.640543 4.813505 0.423881 0.311484 
0.149830 0.126991 0.191503 0.010690 0.320627 0.072854 0.044265 0.008705 0.108882 
0.395337 0.301848 0.068427 0.015076 0.594007 0.582457 0.069673 0.044261 0.366317 4.145067 
0.536518 6.326067 2.145078 0.282959 0.013266 3.234294 1.807177 0.296636 0.697264 0.159069 0.137500 
1.124035 0.484133 0.371004 0.025548 0.893680 1.672569 0.173735 0.139538 0.442472 4.273607 6.312358 0.656604 
0.253701 0.052722 0.089525 0.017416 1.105251 0.035855 0.018811 0.089586 0.682139 1.112727 2.592692 0.023918 1.798853 
1.177651 0.332533 0.161787 0.394456 0.075382 0.624294 0.419409 0.196961 0.508851 0.078281 0.249060 0.390322 0.099849 0.094464 
4.727182 0.858151 4.008358 1.240275 2.784478 1.223828 0.611973 1.739990 0.990012 0.064105 0.182287 0.748683 0.346960 0.361819 1.338132 
2.139501 0.578987 2.000679 0.425860 1.143480 1.080136 0.604545 0.129836 0.584262 1.033739 0.302936 1.136863 2.020366 0.165001 0.571468 6.472279 
0.180717 0.593607 0.045376 0.029890 0.670128 0.236199 0.077852 0.268491 0.597054 0.111660 0.619632 0.049906 0.696175 2.457121 0.095131 0.248862 0.140825 
0.218959 0.314440 0.612025 0.135107 1.165532 0.257336 0.120037 0.054679 5.306834 0.232523 0.299648 0.131932 0.481306 7.803902 0.089613 0.400547 0.245841 3.151815 
2.547870 0.170887 0.083688 0.037967 1.959291 0.210332 0.245034 0.076701 0.119013 10.649107 1.702745 0.185202 1.898718 0.654683 0.296501 0.098369 2.188158 0.189510 0.249313 

0.079066 0.055941 0.041977 0.053052 0.012937 0.040767 0.071586 0.057337 0.022355 0.062157 0.099081 0.064600 0.022951 0.042302 0.044040 0.061197 0.053287 0.012066 0.034155 0.069147 

(This is an example of an LG matrix taken from PAML package). Note that the amino-acid order in this file is:

 A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

Amino-acid frequencies

By default, AA frequencies are given by the model. Users can change this with:

FreqType Explanation
+F empirical AA frequencies from the data.
+FO ML optimized AA frequencies from the data.
+FQ Equal AA frequencies.

Users can also specify AA frequencies with, e.g.:

+F{0.079066,0.055941,0.041977,0.053052,0.012937,0.040767,0.071586,0.057337,0.022355,0.062157,0.099081,0.064600,0.022951,0.042302,0.044040,0.061197,0.053287,0.012066,0.034155,0.069147}

(Example corresponds to the AA frequencies of the LG matrix).

Codon models

To apply a codon model one should use the option -st CODON to tell IQ-TREE that the alignment contains protein coding sequences (otherwise, IQ-TREE thinks that it contains DNA sequences and will apply DNA models). This implicitly applies the standard genetic code. You can change to an other genetic code by appending the appropriate ID to the CODON keyword:

Code Genetic code meaning
CODON1 The Standard Code (same as -st CODON)
CODON2 The Vertebrate Mitochondrial Code
CODON3 The Yeast Mitochondrial Code
CODON4 The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
CODON5 The Invertebrate Mitochondrial Code
CODON6 The Ciliate, Dasycladacean and Hexamita Nuclear Code
CODON9 The Echinoderm and Flatworm Mitochondrial Code
CODON10 The Euplotid Nuclear Code
CODON11 The Bacterial, Archaeal and Plant Plastid Code
CODON12 The Alternative Yeast Nuclear Code
CODON13 The Ascidian Mitochondrial Code
CODON14 The Alternative Flatworm Mitochondrial Code
CODON16 Chlorophycean Mitochondrial Code
CODON21 Trematode Mitochondrial Code
CODON22 Scenedesmus obliquus Mitochondrial Code
CODON23 Thraustochytrium Mitochondrial Code
CODON24 Pterobranchia Mitochondrial Code
CODON25 Candidate Division SR1 and Gracilibacteria Code

(The IDs follow the specification at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi).

Codon substitution rates

IQ-TREE supports several codon models:

Model Explanation
MG Nonsynonymous/synonymous (dn/ds) rate ratio (Muse and Gaut, 1994).
MGK Like MG with additional transition/transversion (ts/tv) rate ratio.
MG1KTS or MGKAP2 Like MG with a transition rate (Kosiol et al., 2007).
MG1KTV or MGKAP3 Like MG with a transversion rate (Kosiol et al., 2007).
MG2K or MGKAP4 Like MG with a transition rate and a transversion rate (Kosiol et al., 2007).
GY Nonsynonymous/synonymous and transition/transversion rate ratios (Goldman and Yang, 1994).
GY1KTS or GYKAP2 Like GY with a transition rate (Kosiol et al., 2007).
GY1KTV or GYKAP3 Like GY with a transversion rate (Kosiol et al., 2007).
GY2K or GYKAP4 Like GY with a transition rate and a transversion rate (Kosiol et al., 2007).
ECMK07 or KOSI07 Empirical codon model (Kosiol et al., 2007).
ECMrest Restricted version of ECMK07 that allows only one nucleotide exchange.
ECMS05 or SCHN05 Empirical codon model (Schneider et al., 2005).

The last three models (ECMK07, ECMrest or ECMS05) are called empirical codon models, whereas the others are called mechanistic codon models.

Moreover, IQ-TREE supports combined empirical-mechanistic codon models using an underscore separator (_). For example:

  • ECMK07_GY2K: The combined ECMK07 and GY2K model, with the rate entries being multiplication of the two corresponding rate matrices.

Thus, there can be many such combinations.

If the model name does not match any of the above listed models, IQ-TREE assumes that it is a file containing codon exchange rates and frequencies in PAML format. It contains the lower diagonal part of the matrix and codon frequencies. For an example, see http://www.ebi.ac.uk/goldman/ECM/.

NOTICE: Branch lengths under codon models are interpreted as number of nucleotide substitutions per codon site. Thus, they are typically 3 times longer than under DNA models.

Codon frequencies

IQ-TREE supports the following codon frequencies:

FreqType Explanation
+F Empirical codon frequencies counted from the data.
+FQ Equal codon frequencies.
+F1X4 Unequal nucleotide frequencies but equal nt frequencies over three codon positions.
+F3X4 Unequal nucleotide frequencies and unequal nt frequencies over three codon positions.

If not specified, the default codon frequency will be +F3X4 for MG-type models, +F for GY-type models and given by the model for empirical codon models.

Binary and morphological models

The binary alignments should contain state 0 and 1, whereas for morphological data, the valid states are 0 to 9 and A to Z.

Model Explanation
JC2 Jukes-Cantor type model for binary data.
GTR2 General time reversible model for binary data.
MK Jukes-Cantor type model for morphological data.
ORDERED Allowing exchange of neighboring states only.

Except for GTR2 that has unequal state frequencies, all other models have equal state frequencies.

NOTICE: If morphological alignments do not contain constant sites (typically the case), then an ascertainment bias correction model (+ASC) should be applied to correct the branch lengths for the absence of constant sites.

Ascertainment bias correction

An ascertainment bias correction (+ASC) model (Lewis, 2001) should be applied if the alignment does not contain constant sites (such as morphological or SNPs data). For example:

  • MK+ASC: For morphological data.
  • GTR+ASC: For SNPs data.

+ASC will correct the likelihood conditioned on variable sites. Without +ASC, the branch lengths might be overestimated.

Rate heterogeneity across sites

IQ-TREE supports all common rate heterogeneity across sites models:

RateType Explanation
+I allowing for a proportion of invariable sites.
+G discrete Gamma model (Yang, 1994) with default 4 rate categories. The number of categories can be changed with e.g. +G8.
+I+G invariable site plus discrete Gamma model (Gu et al., 1995).
+R FreeRate model (Yang, 1995; Soubrier et al., 2012) that generalizes the +G model by relaxing the assumption of Gamma-distributed rates. The number of categories can be specified with e.g. +R6 (default 4 categories if not specified). The FreeRate model typically fits data better than the +G model and is recommended for analysis of large data sets.

TIP: The new model selection procedure (-m TESTNEW option) tests the FreeRate model, whereas the standard procedure (-m TEST) does not.

Users can fix the parameters of the model. For example, +I{0.2} will fix the proportion of invariable sites (pinvar) to 0.2; +G{0.9} will fix the Gamma shape parameter (alpha) to 0.9; +I{0.2}+G{0.9} will fix both pinvar and alpha. To fix the FreeRate model parameters, use the syntax +Rk{w1,r1,...,wk,rk} (replacing k with the number of categories). Here, w1, ..., wk are the weights and r1, ..., rk the rates for each category.

NOTICE: For the +G model IQ-TREE implements the mean approximation approach (Yang, 1994). The same is done in RAxML and PhyML. However, some programs like TREE-PUZZLE implement the median approximation approach, which makes the resulting log-likelihood not comparable. IQ-TREE can change to this approach via the -gmedian option.

Clone this wiki locally