FourLeafMLE.VMethod

V(θ,τ)

Description

Auxilliary function. Given a vector θ of Hadamard parameters and a topology τ∈{1,2,3} (= 13|23,12|34,14|23), output a vector consisting of those products of the Hadamard parameters (corresponding to certain paths on the quartet with topology τ) which are used to compute the probability of the site patterns.

Example

`[V(θ,τ) for τ in 1:4]'

To see the effect symbolically, run

`@var θ[1:8]; computeProbabilityVector(θ,1)'

FourLeafMLE.computeProbabilityVectorMethod

computeProbabilityVector(θ,τ)

Description

Equivalent to the function `computeprobabilityvector', but faster. So see the documentation for that function.

FourLeafMLE.compute_R1_or_R2_critical_points_with_homotopyMethod

computeR1orR2criticalpointswithhomotopy(SITEPATTERN_DATA,τ)

Arguments

  • `SITEPATTERNDATA' : length 8 vector of site pattern counts
  • `τ' : the tree topology (takes values 1,2,3 or 4). See function description.

Description:

Use HomotopyContinuation to compute the nonnegative real critical points to the score equations for the CFN model on a quartet tree with topology τ∈{1,2,3} (corresponding to 13|23, 12|23, 14|23 respectively) given data. The case τ=4 corresponds to the star tree with four leaves.

Output:

A list of critical points for the specified data and topology. Each critical point is a probability measure which may or may not correspond to a tree with finite and positive branch lengths.

FourLeafMLE.compute_coefficient_matrixMethod

Compute the matrix of coefficients used for computing likelihoods for a quartet with topology τ. (They're Hadamard matrices. See Semple and Steel, chapter 8.6).

FourLeafMLE.compute_hadamard_parametersMethod

computehadamardparameters(p,τ)

Description

Calculate the Hadamard edge parameters θ such that the CFN process run on a 4-leaf tree with topology τ and edge parameters θ induces the site pattern distribution given by p.

FourLeafMLE.compute_probability_vectorMethod
compute_probability_vector(θ,τ)

Arguments

  • `θ' : A vector θ = [θ₁,...,θ₅] Hadamard branch parameters.

  • `τ' : A number in {1,2,3,4}, corresonding to the topology of the tree. Here, 1,2,3 correspond to quartet topologies 13|23, 12|34, and 14|23 respectively, and 4 correspond to the star tree topology (a four-leaf tree with internal branch length zero.

Description

Compute the CFN probability model corresponding to the 4-leaf tree with topology τ and branch lengths θ. Uses Formula from Semple and Steel's book on Phylogenetics. See Cor 8.6.6 on p201.

Examples

`computeprobabilityvector([.2,.3,.02,.4,.7],1)'

To display the formulas symbolically, run

`@var θ[1:5]; computeprobabilityvector(θ,1)'

FourLeafMLE.fourLeafMLEMethod
fourLeafMLE(SITE_PATTERN_DATA)

Description

Attempt to return a unique global maximum by computing the local maxima over the interior of the model as well as all submodels, and returning a list of only those which maximize the likelihood.

Arguments

  • SITE_PATTERN_DATA : A site frequency vector of length 8, specifying the number of times each site pattern is observed in the data. The patterns are ordered as [xxxx, xxxy, xxyx, xxyy, xyxx, xyxy, xyyx, xyyy] where x and y signify different nucleotides. For example, the vector SITE_PATTERN_DATA=[14,2,5,10,7,4,3,6] means that pattern xxxx is observed 14 times, pattern xxxy is observed 2 times, pattern xxyx is observe 5 times, and so forth.

Output

A list of vectors corresponding to trees (or submodels) which maximize the likelihood. Each vector corresponds to a located maximum and takes the form


[logL,
 reduced tree class,
 list of compatible binary quartet topologies,
 branch lengths (excluding those which equal 0 or 1),
 branch length names,
 labels,
 verbal description]

Here, the 'reduced tree class' refers to a scheme developed by the authors to categorize submodels by the type of computations required to find their local maxima. There are 10 nontrivial classes (i.e., which having nonzero likelihoods for generic data), denoted R1, R2, ..., R10. For example, R1 = binary quartet, R2 = star tree, and R3 through R10 refer to additional submodel types whose details can be found in the comments of the file aux-functions-for-R3-through-R10.jl.

If two or more local maxima are found which maximize the likelihood, then this function will return more than one maximum likelihood estimator. There are two possible reasons for this (1) the MLE is indeed not unique, or (2) the floating point arithmetic is insufficiently precise to distinguish which maxima has greater likelihood.

Example usage:


random_hadamard_edge_parameters=rand(5)
test_model=computeProbabilityVector(random_hadamard_edge_parameters,1)
N=1000
SITE_PATTERN_DATA = rand(Multinomial(N,test_model))
fourLeafMLE(SITE_PATTERN_DATA)
listMaxima(SITE_PATTERN_DATA)
FourLeafMLE.generate_averaged_classification_plot_dataMethod
generate_averaged_classification_plot_data(sequence_length,
                                           m,
                                           lower_x, lower_y, upper_x, upper_y,
                                           number_of_x_values, number_of_y_values;
                                           Hadamard_mode=false)

Description

Experimental function (along with make_averaged_classification_plot) to produce a color-averaged Felsentstein plot representing many replicates at each pixel.

FourLeafMLE.generate_classification_plot_dataMethod
generate_classification_plot_data(sequence_length,
                                  lower_x, lower_y, upper_x, upper_y,
                                  number_of_x_values, number_of_y_values;
                                  Hadamard_mode=false)

Description

For each pair (x,y) with x and y within some user-specified range, do the following:

(1) generate a single datapoint of length sequence_length bp from a quartet tree model with topology 12|34 such that the branch lengths of leaves 1 and 3 are y and all other branch lengths are x,

(2) infer the maximum likelihood tree estimate(s) for that datapoint,

(3) determine which binary quartet topologies are compatible with the resulting maximum likelihood estimate (τ=1 is 12|34, τ=2 is 13|23, and τ=3 is 14|23).

Then save the pair (x,y) together with the inferred compatible topology classification.

Arguments

  • sequence_length : length in bp of the data to be generated for each parameter regime (x,y)

  • lower_x, lower_y, upper_x, upper_y : these specify the lower and upper bounds for the values of x and y

  • number_of_x_values, number_of_y_values : the number of (evenly space) values to be chosen and sampled from the interval with specified upper and lower bounds. This determines the step size and hence the resolution of the plot; higher values will result in a plot that is less pixelated but will take longer to produce.

  • Hadamard_mode : Optional parameter which determines which of 2 modes the function is run in. When Hadamard_mode = false, x and y are interpreted as being branch lengths measured in expected number of mutations per site (i.e., evolutionary distance). When `Hadamard_mode = true1, x and y are interpreted as being Hadamard edge parameters, (i.e., which are obtained by applying the function d -> exp(-2d) to the evolutionary distances).

FourLeafMLE.generate_sample_complexity_plot_dataMethod
generate_sample_complexity_plot_data

Description

Experimental function to ascertain the relationship between the observed long- branch attraction effects, and known sample complexity lower bounds on the amount of data required for high probability of correct inference.

Warning

This function is experimental and not fully tested.

FourLeafMLE.get_pMethod

getp(newtonresult)

Extract the probability measure from the Newton-refined critical point. Note this is just the first 8 elements of the critical point (the rest of the entries are Lagrange multipliers that we don't need anymore). Since the critical points are known to be real, take only the real part.

FourLeafMLE.isValidMethod

isValid(arg1,arg2,optional_arg3)

Description

Check that the critical point corresponds to nonnegative or positive Hadamard values. If these conditions are not satisfied, then the model should be thown out because it's not an FBM critical point (but is rather an IBM critical point).

FourLeafMLE.listMaximaMethod
listMaxima(SITE_PATTERN_DATA)

Description

Return a list of local maxima or critical points, sorted by log-likelihood. The only difference with the function fourLeafMLE() is that listMaxima does not filter out critical points which do not achieve the global optimum. See the documentation for fourLeafMLE().

Example usage:


random_hadamard_edge_parameters=rand(5)
test_model=computeProbabilityVector(random_hadamard_edge_parameters,1)
N=1000
SITE_PATTERN_DATA = rand(Multinomial(N,test_model))
listMaxima(SITE_PATTERN_DATA)
FourLeafMLE.make_averaged_classification_plotMethod
make_averaged_classification_plot(sequence_length,
                                  x_values, y_values, scores;
                                  topology=nothing,
                                  marker_size=nothing,
                                  Hadamard_mode=false)

Description

Using data from generate_averaged_classification_plot_data, makes a Felsenstein plot with points representing the propotion of samples whose MLE is concordant with the specified topology.

Output

Saves a plot in the directory `plots/'

FourLeafMLE.make_classification_plotMethod
make_classification_plot(sequence_length,
                         x_values, y_values,
                         categorical_data;
                         marker_size=nothing,
                         Hadamard_mode=false)

Description

Makes a plot using data obtained from the function generate_classification_plot_data.

Arguments

  • sequence_length : the size of the data, in base pairs, to be generated for each parameter regime (i.e., for each point on the plot)

  • x_values, y_values, categorical_data : the output of generate_classification_plot_data.

  • marker_size : Optional parameter. Determines the size of the squares made by the scatter plot. Depending on the step size (which is automatically deduced from the vectors x_values and y_values), this may need to be adjusted. Ideally, marker_size should be large enough that the squares are visible and are almost or exactly adjacent, but not so large that they overlap. If no value is supplied by the user, a sensible default will be chosen, which is 171 divided by max(number_of_x_values,number_of_y_values).

  • Hadamard_mode : Optional parameter. Specifes the plotting mode. If true, then x_values and y_values are interpreted as Hadamard edge parameters. If false, then they are interpreted to be branch length distances measured in expected number of mutations per site. This affects the axes and labels of the plot. In order to obtain good-looking plots, the value of this variable should be the same as the value used when generating the data (otherwise the boxes won't be evenly spaced).

Output

A plot like thos shown in the abstract. The plot is saved as an .svg file to "plots/hadamard-plot.svg".

Example

To replicate the figure in the abstract, run

    sequence_length=1000
    x_values, y_values, categorical_results = generate_classification_plot_data(sequence_length, .01, .01, .99, .99, 300, 300, Hadamard_mode=true)
    make_classification_plot(sequence_length, x_values, y_values, categorical_results, Hadamard_mode=true);

This will save an .svg image of the plot to the working directory.

FourLeafMLE.make_sample_complexity_plotMethod
make_sample_complexity_plot

Description

Experimental function to ascertain the relationship between the observed long- branch attraction effects, and known sample complexity lower bounds on the amount of data required for high probability of correct inference.

Warning

This function is experimental and not fully tested.

FourLeafMLE.test_data_genericityMethod
test_data_genericity(SITE_PATTERN_DATA)

Description

Test whether a data vector is generic. Here, 'generic' means that for each pair of leaves, i and j, there exists at least one site such that the base pair at leaf i differs from the base pair at leaf j. The functions in this package cannot handle non-generic data without errors.

Examples

test_data_genericity([1,0,0,0,0,0,0,0])

test_data_genericity([1,3,5,0,0,9,5,4])
FourLeafMLE.test_random_binary_quartet_modelFunction

testrandombinaryquartetmodel(branchlengthdistribution)

Description

Test whether the edge parameters can be correctly estimated from the true model when the true model is a quartet tree with random topology and uniformly-chosen random edge parameters. Success occurs when (1) the estimate is unique, (2) its topology is compatible with that of the true topology model parameter, (3) a boundary model is not inferred, and (4) the Euclidean distance between the estimated edge parameters and true edge parameters is smaller than 1e-10. This test is probabilistic and occasionally fails due to the limits of approximate arithmetic on very pathological trees (i.e. trees very close to the boundary of the model.

Arguments

  • branch_length_distribution : the probability distribution from which branch lengths are drawn. Defaults to uniform.

Output

If all 4 conditions are met, the test returns 'nothing', otherwise it returns a vector containing details about what went wrong.

FourLeafMLE.test_random_infinite_branch_modelMethod

testrandominfinitebranchmodel(N_branches)

Description

Test whether the estimator returns a unique MLE whose estimated topology is compatible with the true tree, when the model one or more more infinitely-long branches (and the model is used as the data)

Arguments

  • N_branches : Number of branches of infinite length (the branches are randomly-chosen)

Output

Returns nothing if the test is successful. Otherwise it returns a vector containing details about what went wrong.

FourLeafMLE.test_random_zero_branch_modelFunction

testrandomzerobranchmodels(branchlengthdistribution)

Description

Generates a random tree tree model (with random topology, and random branch lengths, and with 1-2 randomly-chosen branches having lengths zero evolutionary distance). Using the model as data, it then tests whether the estimator returns a unique MLE whose estimated topology is consistent with the true tree.

Arguments (optional)

  • branch_length_distribution : the probability distribution from which branch lengths are drawn. Defaults to uniform.

Output

If the MLE is not unique, or if the MLE incorrectly estimated the topology, a vector is returned which contains details about what happened. Otherwise, the function returns nothing.