Package summary
This Julia package implements statistical inference using maximum likelihood estimation (MLE) to estimate a 4leaf phylogenetic tree from DNA sequence data. The assumed model is the CavendarFarrisNeyman (CFN) model from phylogenetics.
Unlike other methods, the implementation in this package does not rely on a heuristic search to solve the global optimization problem. Instead, the solution to the MLE problem is obtained through direct computation of the critical points of the likelihood function using both analytical methods as well as tools from numerical algebraic geometry.
Quickstart instructions
First, open a Julia REPL from the command line by running
julia threads=auto
The optional flag threads=auto
is recommended for best performance. (Once Julia is running, you can
check the number of threads with Threads.nthreads()
.)
Second, to get the package and all dependencies, run
using Pkg; Pkg.add("FourLeafMLE")
This step may take a few minutes.
Finally, load the package by running
using FourLeafMLE
You can now run all the functions in the export list of FourLeafMLE.jl. Note that the first time you run a function in Julia, it will take some time to precompile.
As an example, suppose our data take the form of the site pattern frequency vector
[123,43,54,83,21,32,24,51]
. We can perform maximum likelihood estimation on this data by by
running
fourLeafMLE([123,43,54,83,21,32,24,51])
To obtain a list of all local maxima over both interior and boundary components of the model, run
listMaxima([123,43,54,83,21,32,24,51])
To replicate Figure 1 in this readme, run
sequence_length=1000
x_values, y_values, categorical_results = generate_classification_plot_data(sequence_length, .01, .01, .99, .99, 300, 300, Hadamard_mode=true)
make_classification_plot(sequence_length, x_values, y_values, categorical_results, Hadamard_mode=true)
This will save an .svg image of the plot to the working directory.
Additional documentation
For more information about the estimation functions, see the function docstrings, for example by running
@doc fourLeafMLE
A detailed overview of how to interpret the output of the estimation functions is also povided in supplementaldocumentation.pdf
Background: tree topologies and the tree of life
One of the core motivations in evolutionary biology is to reconstruct the tree of life that describes how species have evolved over time. A common approach is to use DNA sequence alignment data from presentday taxa to infer a phylogenetic tree, understood as a leaflabelled tree with edges representing ancestral populations, vertices representing common ancestors, and leaves representing extant taxa. The edge weights, or branch lengths represent some measure of evolutionary time (usually expected number of mutations per site), and the tree topology describes clade structure of the taxa. For instance, Figure 1 has four leaves that are labeled in such a way to indicate species $1$ and $2$ have a more recent common ancestor with one another than with species $3$ or $4$.
A phylogenetic tree thus represents a hypothesis about the evolutionary history of a set of taxa; the statistical inference problem is to infer the tree topology and branch lengths from a DNA sequence of fixed length. Our implementation solves polynomial systems to perform this inference.
Maximum likelihood estimation and quartet methods
A standard approach to inferring a phylogenetic tree is to use maximum
likelihood estimation. By summarizing data as a vector $(u_1,\dots,u_n)$
of counts, the goal is to maximize the likelihood of the data. In other
words, we seek the value of $\theta$ that maximizes
$p_1^{u_1}\cdots p_n^{u_n}$ where $p_i$ is the probability of observing
event $i$ and each $p_i$ depends on the parameters $\theta$ of the
model. The maximum likelihood problem addressed in fourLeafMLE
involves estimating the branch lengths and unrooted tree topology of a
4leaf tree from sequence data of a fixed length $k$ generated according
to the CavendarFarrisNeyman (CFN) model [12].
Considerable research has focused on understanding the properties of maximum likelihood estimation. Even in the simplest cases of 3 and 4leaf trees, the problem exhibits substantial complexity, with a general solution known for 3leaf trees, but not 4leaf trees [1,4,6,7,8,9,11,16]. The $4$leaf case considered here is of special interest first due to the popularity of quartetbased inference methods for inferring both phylogenetic trees and networks [15], and second because it is the simplest case in which the phenomenon of longbranch attraction can be observed, a form of estimation bias which is only partially understood [2,14].
Two challenges in optimizing the likelihood function are its nonconvex nature and the presence of numerous boundary cases during optimization. Moreover, there can be multiple distinct maximizers of the likelihood function, and the maximizers can have branch lengths which are infinitely long or zero [5,13]. This package resolves these two obstacles for the CFN model by using computer algebra and the theory of maximum likelihood (ML) degrees [10]. One feature distinguishing this software is that it implements a framework to characterize case when the tree estimates have infinite or zerolength branches. We believe this is useful in obtaining an understanding of the ways that maximum likelihood inference can fail, and as a first step leads to Figure 1.
Running the maximum likelihood estimator
Our main contribution puts algebra into practice by implementing a custom tailored solver for algebraic statisticians. Given data in the form of a site frequency vector, the main functionality is to return a list of global maximizers of the likelihood function. The output includes information about each of the maximizers, for example the tree configuration and optimal branch lengths.
SITE_PATTERN_DATA = [212, 107, 98, 115, 114, 89, 102, 163]
fourLeafMLE(SITE_PATTERN_DATA)
which has output
1element Vector{Vector{Any}}:
[2036.1212979788797,
"R1",
[1],
[0.1380874841, 0.46347429951, 0.5231552324, 0.3975875363, 0.6835395124],
"θ1, θ2, θ3, θ4, θ5",
"binary quartet with topology τ=1234 and edge parameters (θ₁,...,θ₅) = (0.1380874841,...,0.6835395124)"]
For complete details on the code, see the code documentation provided here, but now we
give a broad overview about our software. The CFN model is a
parameterized semialgebraic set in $\mathbb{R}^8$ with dimension $6$. We
optimize the likelihood function by partitioning the semialgebraic set
according to boundaries from the parameterization, similiar to the
approach in [1] for a different model. There are $10$
different boundaries up to symmetry. All but one of them have ML degree
less than two. The last one has ML degree 92 while the main component of
the semialgebraic set has ML degree fourteen.^{1} We optimize the
likelihood function on each of the boundary cases by solving the
likelihood equations [10] by using analytic expressions when the
ML degree is less than two and
Homotopy Continuation.jl
[@HomotopyContinuation.jl] to compute the
critical points otherwise.
Putting into practice: Visualizing the geometry of data
Our method is fast. On a desktop machine with 12 i510400 CPUs
(2.90GHz), the fourLeafMLE
solves on average 7.8 global optimization
problems per second. This is sufficiently fast to create highquality
visualizations of the geometry of the maximum likelihood problem.
For example, consider 4leaf trees of the following form:
The following plot consists of a grid of $90,000$ points, each representing a choice of Hadamard edge length parameters $\theta_x,\theta_y$ (see [12]) for the above tree:
Tests
To run the tests in test/runtests.jl, open Julia with working directory path/to/FourLeafMLE/
and run
using Pkg; Pkg.activate(".")
Pkg.test("FourLeafMLE")
Note this will return a (harmless) deprecation warning about the use of LU
. Most of the tests involve
generating random 4leaf trees and then performing maximum likelihood estimation on each tree using the
model as data. When a test is passed, that means the randomlygenerated trees were all (or almost all)
correctly estimated. Some of these tests will fail on occasion due to approximation errors in the
arithmetic. Additional commentary and details about failure rates and causes can be found in
test/runtests.jl
Citations

Allman ES, Rhodes JA. Phylogenetic invariants for the general Markov model of sequence mutation. Mathematical Biosciences. 2003;186(2):113144. doi:10.1016/j.mbs.2003.08.004

Bergsten J. A review of longbranch attraction. Cladistics. 2005;21(2):163193.

Breiding P, Timme S. HomotopyContinuation.jl: A Package for Homotopy Continuation in Julia. In: International Congress on Mathematical Software. Springer; 2018:458465.

Chor B, Hendy M, Penny D. Analytic solutions for three taxon ML trees with variable rates across sites. Discrete Applied Mathematics. 2007;155(67):750758.

Chor B, Hendy MD, Holland BR, Penny D. Multiple Maxima of Likelihood in Phylogenetic Trees: An Analytic Approach. Molecular Biology and Evolution. 10 2000;17(10):15291541. doi:10.1093/oxfordjournals.molbev.a026252

Chor B, Snir S. Molecular clock fork phylogenies: Closed form analytic maximum likelihood solutions. Systematic Biology. 2004;53(6):963967.

GarciaPuente LD, Porter J. Small Phylogenetic Trees https://www.coloradocollege.edu/aapps/ldg/smalltrees/smalltrees_0.html. Published online 2007.

Hill M, Roch S, Rodriguez JI. Maximum Likelihood Estimation for Unrooted 3Leaf Trees: An Analytic Solution for the CFN Model. bioRxiv. Published online 2024:20242002.

Hobolth A, Wiuf C. Maximum likelihood estimation and natural pairwise estimating equations are identical for three sequences and a symmetric 2state substitution model. Theoretical Population Biology. Published online 2024.

Hoşten S, Khetan A, Sturmfels B. Solving the likelihood equations. Found Comput Math. 2005;5(4):389407. doi:10.1007/s1020800401568

Kosta D, Kubjas K. Maximum likelihood estimation of symmetric groupbased models via numerical algebraic geometry. Bull Math Biol. 2019;81(2):337360. doi:10.1007/s1153801805232

Semple C, Steel M, Phylogenetics. Vol 24. Oxford University Press on Demand; 2003.

Steel M. The Maximum Likelihood Point for a Phylogenetic Tree is not Unique. Systematic Biology. 1994;43(4):560564. Accessed August 21, 2023. http://www.jstor.org/stable/2413552

Susko E, Roger AJ. Long Branch Attraction Biases in Phylogenetics. Systematic Biology. 02 2021;70(4):838843. doi:10.1093/sysbio/syab001

Warnow T. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press; 2017. doi:10.1017/9781316882313

Yang Z. Complexity of the simplest phylogenetic estimation problem. Proceedings of the Royal Society of London Series B: Biological Sciences. 2000;267(1439):109116.

The numbers 14 and 92 were computed previously in [7]. ↩