Nucleotide Substitution

This simulation illustrates the true and estimated divergence between two DNA sequences based on an explicit model of mutational change.

The input parameters are the two transition rates and the four possible transversion rates along with equilibrium nucleotide frequencies. These are the parameters that make up a nucleotide substitution model. The Example model parameters radio buttons set all of the model parameters at once to correspond to one of the commonly used nucleotide substitution models. For example, The JC69 button will set all four base frequencies to be equal and all six of the base change rates to be equal as assumed under that model.

The Jukes-Cantor of JC69 model assumes all types of substitutions occur at one rate, and that equilibrium base frequencies are all 25%. The Kimura 1980 or K80 model (also called the Kimura two parameter of K2P model) assumes that transitions and transversions occur at different rates, and that equilibrium base frequencies are all 25%. The Tamura 92 model is a special case of the K2P model, assuming different rates of transition and transversion but with equilibrium base frequencies that are unequal. The Tamura-Nei model assumes that two types of transitions occur at different rates, that all transversions have one rates, and that equilibrium base frequencies are unequal.

After setting the model parameters, press Run and view the graph. The x-axis is time while the y-axis is divergence (portion of sites diverged). One line in the plot shows apparent divergence or p-distance between two sequences. Compare this with the line showing the actual amount of divergence based on the total number of sites that have experienced substitutions in the two sequences. There are four nucleotide substitution model-corrected estimates of divergence: JC69, K80 or K2P, Tamura 92, and Tamura Nei 93. (Line styles in the plot vary by how the simulation is implemented - see the plot legend.)

Under all nucleotide substitution models, as time increases apparent divergence shows an asymptote and levels off. The true divergence continues to increase linearly. (It is possible for the true divergence to exceed 1.0 or 100% since it is based on the number of sites that experience a substitution and each site can experience more than one substitution.) Divergence shows saturation because substitutions occur at sites that have already experienced a substitution - so-called "multiple hits." The correction provided by a given distance metric in the graph is more or less accurate depending on the underlying nucleotide substitution model. The model-corrected divergence estimates all provide some degree of improvement over the p-distance, bringing the estimate of divergence closer to the actual amount of divergence. How effective a nucleotide substitution model is at correcting the apparent divergence depends the pattern of substitution in the evolving sequences. When the sequences evolve with a simple pattern (e.g. one rate of change and all equilibrium base frequencies equal 0.25) then JC69 does a decent job as a divergence correction. However, when the sequences evolve with a more complex pattern then the nucleotide substitution models with more parameters often do a better job as a divergence correction. Note that nucleotide substitution models with more parameters are not necessarily better divergence corrections.

For more background, see chapter 8 in Hamilton, 2009.

Divergence between two DNA sequences

Apparrent divergence (p-distance) and divergence estimateed with nucleotide substitution models

Equillibrim nucleotide frequencies

Rates of state changes between nucleotides
Transitions (purines)
r(A → G)
r(G → A)
Transitions (pyrimidines)
r(C → T)
r(T → C)
r(A → C) r(C → A)
r(A → T) r(T → A)
r(G → C) r(C → G)
r(G → T) r(T → G)
Example model parameters

Nuc Substitution Plot