FourLeafMLE.V
— MethodV(θ,τ)
Description
Auxilliary function. Given a vector θ of Hadamard parameters and a topology τ∈{1,2,3} (= 13|23,12|34,14|23), output a vector consisting of those products of the Hadamard parameters (corresponding to certain paths on the quartet with topology τ) which are used to compute the probability of the site patterns.
Example
`[V(θ,τ) for τ in 1:4]'
To see the effect symbolically, run
`@var θ[1:8]; computeProbabilityVector(θ,1)'
FourLeafMLE.computeProbabilityVector
— MethodcomputeProbabilityVector(θ,τ)
Description
Equivalent to the function `computeprobabilityvector', but faster. So see the documentation for that function.
FourLeafMLE.compute_R1_or_R2_critical_points_with_homotopy
— MethodcomputeR1orR2criticalpointswithhomotopy(SITEPATTERN_DATA,τ)
Arguments
- `SITEPATTERNDATA' : length 8 vector of site pattern counts
- `τ' : the tree topology (takes values 1,2,3 or 4). See function description.
Description:
Use HomotopyContinuation to compute the nonnegative real critical points to the score equations for the CFN model on a quartet tree with topology τ∈{1,2,3} (corresponding to 13|23, 12|23, 14|23 respectively) given data. The case τ=4 corresponds to the star tree with four leaves.
Output:
A list of critical points for the specified data and topology. Each critical point is a probability measure which may or may not correspond to a tree with finite and positive branch lengths.
FourLeafMLE.compute_all_complex_critical_points
— MethodSolve a random system. To get a start system for each type.
FourLeafMLE.compute_coefficient_matrix
— MethodCompute the matrix of coefficients used for computing likelihoods for a quartet with topology τ. (They're Hadamard matrices. See Semple and Steel, chapter 8.6).
FourLeafMLE.compute_hadamard_parameters
— Methodcomputehadamardparameters(p,τ)
Description
Calculate the Hadamard edge parameters θ such that the CFN process run on a 4-leaf tree with topology τ and edge parameters θ induces the site pattern distribution given by p.
FourLeafMLE.compute_probability_vector
— Methodcompute_probability_vector(θ,τ)
Arguments
`θ' : A vector θ = [θ₁,...,θ₅] Hadamard branch parameters.
`τ' : A number in {1,2,3,4}, corresonding to the topology of the tree. Here, 1,2,3 correspond to quartet topologies 13|23, 12|34, and 14|23 respectively, and 4 correspond to the star tree topology (a four-leaf tree with internal branch length zero.
Description
Compute the CFN probability model corresponding to the 4-leaf tree with topology τ and branch lengths θ. Uses Formula from Semple and Steel's book on Phylogenetics. See Cor 8.6.6 on p201.
Examples
`computeprobabilityvector([.2,.3,.02,.4,.7],1)'
To display the formulas symbolically, run
`@var θ[1:5]; computeprobabilityvector(θ,1)'
FourLeafMLE.compute_x_vector
— Methodcomputexvector(p,τ)
Desccription:
See in-line code commentary.
FourLeafMLE.euclidean_distance
— MethodCompute the Euclidean distance between two vectors x and y.
FourLeafMLE.fourLeafMLE
— MethodfourLeafMLE(SITE_PATTERN_DATA)
Description
Attempt to return a unique global maximum by computing the local maxima over the interior of the model as well as all submodels, and returning a list of only those which maximize the likelihood.
Arguments
SITE_PATTERN_DATA
: A site frequency vector of length 8, specifying the number of times each site pattern is observed in the data. The patterns are ordered as[xxxx, xxxy, xxyx, xxyy, xyxx, xyxy, xyyx, xyyy]
where x and y signify different nucleotides. For example, the vectorSITE_PATTERN_DATA=[14,2,5,10,7,4,3,6]
means that pattern xxxx is observed 14 times, pattern xxxy is observed 2 times, pattern xxyx is observe 5 times, and so forth.
Output
A list of vectors corresponding to trees (or submodels) which maximize the likelihood. Each vector corresponds to a located maximum and takes the form
[logL,
reduced tree class,
list of compatible binary quartet topologies,
branch lengths (excluding those which equal 0 or 1),
branch length names,
labels,
verbal description]
Here, the 'reduced tree class' refers to a scheme developed by the authors to categorize submodels by the type of computations required to find their local maxima. There are 10 nontrivial classes (i.e., which having nonzero likelihoods for generic data), denoted R1, R2, ..., R10. For example, R1 = binary quartet, R2 = star tree, and R3 through R10 refer to additional submodel types whose details can be found in the comments of the file aux-functions-for-R3-through-R10.jl
.
If two or more local maxima are found which maximize the likelihood, then this function will return more than one maximum likelihood estimator. There are two possible reasons for this (1) the MLE is indeed not unique, or (2) the floating point arithmetic is insufficiently precise to distinguish which maxima has greater likelihood.
Example usage:
random_hadamard_edge_parameters=rand(5)
test_model=computeProbabilityVector(random_hadamard_edge_parameters,1)
N=1000
SITE_PATTERN_DATA = rand(Multinomial(N,test_model))
fourLeafMLE(SITE_PATTERN_DATA)
listMaxima(SITE_PATTERN_DATA)
FourLeafMLE.generate_averaged_classification_plot_data
— Methodgenerate_averaged_classification_plot_data(sequence_length,
m,
lower_x, lower_y, upper_x, upper_y,
number_of_x_values, number_of_y_values;
Hadamard_mode=false)
Description
Experimental function (along with make_averaged_classification_plot
) to produce a color-averaged Felsentstein plot representing many replicates at each pixel.
FourLeafMLE.generate_classification_plot_data
— Methodgenerate_classification_plot_data(sequence_length,
lower_x, lower_y, upper_x, upper_y,
number_of_x_values, number_of_y_values;
Hadamard_mode=false)
Description
For each pair (x,y) with x and y within some user-specified range, do the following:
(1) generate a single datapoint of length sequence_length
bp from a quartet tree model with topology 12|34 such that the branch lengths of leaves 1 and 3 are y and all other branch lengths are x,
(2) infer the maximum likelihood tree estimate(s) for that datapoint,
(3) determine which binary quartet topologies are compatible with the resulting maximum likelihood estimate (τ=1 is 12|34, τ=2 is 13|23, and τ=3 is 14|23).
Then save the pair (x,y) together with the inferred compatible topology classification.
Arguments
sequence_length
: length in bp of the data to be generated for each parameter regime (x,y)lower_x
,lower_y
,upper_x
,upper_y
: these specify the lower and upper bounds for the values of x and ynumber_of_x_values, number_of_y_values
: the number of (evenly space) values to be chosen and sampled from the interval with specified upper and lower bounds. This determines the step size and hence the resolution of the plot; higher values will result in a plot that is less pixelated but will take longer to produce.Hadamard_mode
: Optional parameter which determines which of 2 modes the function is run in. WhenHadamard_mode = false
, x and y are interpreted as being branch lengths measured in expected number of mutations per site (i.e., evolutionary distance). When `Hadamard_mode = true1, x and y are interpreted as being Hadamard edge parameters, (i.e., which are obtained by applying the function d -> exp(-2d) to the evolutionary distances).
FourLeafMLE.generate_sample_complexity_plot_data
— Methodgenerate_sample_complexity_plot_data
Description
Experimental function to ascertain the relationship between the observed long- branch attraction effects, and known sample complexity lower bounds on the amount of data required for high probability of correct inference.
Warning
This function is experimental and not fully tested.
FourLeafMLE.get_p
— Methodgetp(newtonresult)
Extract the probability measure from the Newton-refined critical point. Note this is just the first 8 elements of the critical point (the rest of the entries are Lagrange multipliers that we don't need anymore). Since the critical points are known to be real, take only the real part.
FourLeafMLE.isNonnegative
— MethodReturn true iff all elemnts of x are nonnegative.
FourLeafMLE.isValid
— MethodisValid(arg1,arg2,optional_arg3)
Description
Check that the critical point corresponds to nonnegative or positive Hadamard values. If these conditions are not satisfied, then the model should be thown out because it's not an FBM critical point (but is rather an IBM critical point).
FourLeafMLE.listMaxima
— MethodlistMaxima(SITE_PATTERN_DATA)
Description
Return a list of local maxima or critical points, sorted by log-likelihood. The only difference with the function fourLeafMLE()
is that listMaxima
does not filter out critical points which do not achieve the global optimum. See the documentation for fourLeafMLE()
.
Example usage:
random_hadamard_edge_parameters=rand(5)
test_model=computeProbabilityVector(random_hadamard_edge_parameters,1)
N=1000
SITE_PATTERN_DATA = rand(Multinomial(N,test_model))
listMaxima(SITE_PATTERN_DATA)
FourLeafMLE.make_averaged_classification_plot
— Methodmake_averaged_classification_plot(sequence_length,
x_values, y_values, scores;
topology=nothing,
marker_size=nothing,
Hadamard_mode=false)
Description
Using data from generate_averaged_classification_plot_data
, makes a Felsenstein plot with points representing the propotion of samples whose MLE is concordant with the specified topology.
Output
Saves a plot in the directory `plots/'
FourLeafMLE.make_classification_plot
— Methodmake_classification_plot(sequence_length,
x_values, y_values,
categorical_data;
marker_size=nothing,
Hadamard_mode=false)
Description
Makes a plot using data obtained from the function generate_classification_plot_data
.
Arguments
sequence_length
: the size of the data, in base pairs, to be generated for each parameter regime (i.e., for each point on the plot)x_values
,y_values
,categorical_data
: the output ofgenerate_classification_plot_data
.marker_size
: Optional parameter. Determines the size of the squares made by the scatter plot. Depending on the step size (which is automatically deduced from the vectorsx_values
andy_values
), this may need to be adjusted. Ideally,marker_size
should be large enough that the squares are visible and are almost or exactly adjacent, but not so large that they overlap. If no value is supplied by the user, a sensible default will be chosen, which is 171 divided bymax(number_of_x_values,number_of_y_values)
.Hadamard_mode
: Optional parameter. Specifes the plotting mode. If true, thenx_values
andy_values
are interpreted as Hadamard edge parameters. If false, then they are interpreted to be branch length distances measured in expected number of mutations per site. This affects the axes and labels of the plot. In order to obtain good-looking plots, the value of this variable should be the same as the value used when generating the data (otherwise the boxes won't be evenly spaced).
Output
A plot like thos shown in the abstract. The plot is saved as an .svg file to "plots/hadamard-plot.svg".
Example
To replicate the figure in the abstract, run
sequence_length=1000
x_values, y_values, categorical_results = generate_classification_plot_data(sequence_length, .01, .01, .99, .99, 300, 300, Hadamard_mode=true)
make_classification_plot(sequence_length, x_values, y_values, categorical_results, Hadamard_mode=true);
This will save an .svg image of the plot to the working directory.
FourLeafMLE.make_sample_complexity_plot
— Methodmake_sample_complexity_plot
Description
Experimental function to ascertain the relationship between the observed long- branch attraction effects, and known sample complexity lower bounds on the amount of data required for high probability of correct inference.
Warning
This function is experimental and not fully tested.
FourLeafMLE.refine_critical_points
— MethodRefine all critical points by applying Newton's method. Output: a list of refined probability measures.
FourLeafMLE.test_data_genericity
— Methodtest_data_genericity(SITE_PATTERN_DATA)
Description
Test whether a data vector is generic. Here, 'generic' means that for each pair of leaves, i and j, there exists at least one site such that the base pair at leaf i differs from the base pair at leaf j. The functions in this package cannot handle non-generic data without errors.
Examples
test_data_genericity([1,0,0,0,0,0,0,0])
test_data_genericity([1,3,5,0,0,9,5,4])
FourLeafMLE.test_random_binary_quartet_model
— Functiontestrandombinaryquartetmodel(branchlengthdistribution)
Description
Test whether the edge parameters can be correctly estimated from the true model when the true model is a quartet tree with random topology and uniformly-chosen random edge parameters. Success occurs when (1) the estimate is unique, (2) its topology is compatible with that of the true topology model parameter, (3) a boundary model is not inferred, and (4) the Euclidean distance between the estimated edge parameters and true edge parameters is smaller than 1e-10. This test is probabilistic and occasionally fails due to the limits of approximate arithmetic on very pathological trees (i.e. trees very close to the boundary of the model.
Arguments
branch_length_distribution
: the probability distribution from which branch lengths are drawn. Defaults to uniform.
Output
If all 4 conditions are met, the test returns 'nothing', otherwise it returns a vector containing details about what went wrong.
FourLeafMLE.test_random_infinite_branch_model
— Methodtestrandominfinitebranchmodel(N_branches)
Description
Test whether the estimator returns a unique MLE whose estimated topology is compatible with the true tree, when the model one or more more infinitely-long branches (and the model is used as the data)
Arguments
N_branches
: Number of branches of infinite length (the branches are randomly-chosen)
Output
Returns nothing
if the test is successful. Otherwise it returns a vector containing details about what went wrong.
FourLeafMLE.test_random_zero_branch_model
— Functiontestrandomzerobranchmodels(branchlengthdistribution)
Description
Generates a random tree tree model (with random topology, and random branch lengths, and with 1-2 randomly-chosen branches having lengths zero evolutionary distance). Using the model as data, it then tests whether the estimator returns a unique MLE whose estimated topology is consistent with the true tree.
Arguments (optional)
branch_length_distribution
: the probability distribution from which branch lengths are drawn. Defaults to uniform.
Output
If the MLE is not unique, or if the MLE incorrectly estimated the topology, a vector is returned which contains details about what happened. Otherwise, the function returns nothing.