Docstrings · Kpax3.jl

Kpax3.AminoAcidData — Type

Genetic data

Description

Amino acid data and its metadata.

Fields

data Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrix
id units' ids
ref reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29

val vector with unique values per MSA site
key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at SNP l. Define m = m1 + ... + mL, where L is the total number of SNPs.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the SNP position by findall(ref == 29)[key[j]].

References

Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.

Kpax3.KSettings — Type

User defined settings for a Kpax3 run

Description

Fields

Kpax3.CategoricalData — Method

CSV data

Description

Generic data and its metadata.

Fields

data original data matrix encoded as a binary (UInt8) matrix
id units' ids
ref reference observation, i.e. a vector of the same length of the original

observations storing the values of homogeneous sites. Polymorphisms are instead represented by the string "."

val vector with unique values per dataset column
key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at polymorphic column l. Define m = m1 + ... + mL, where L is the total number of polymorphic columns.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the polymorphic position by findall(ref == ".")[key[j]].

References

Kpax3.categorical2binary — Method

Convert categorical (string) data to binary

Description

Convert a string matrix to a binary (indicator) matrix.

Usage

categorical2binary(data, missval)

Arguments

data Integer matrix to be converted
missval Value to be considered missing

Value

A tuple containing the following variables:

bindata Original data matrix encoded as a binary (indicator) matrix
val vector with unique values per MSA site
key vector with indices of each value

Example

If data consists of just the following three units

  C A
A G C
C T
C C G

then bindata will be equal to

while

val = ["A", "C", "A", "C", "G, "C", "T", "C", "G"] (i.e. AC ACG CT CG)
key = [ 1,   1,   2,   2,   2,  3,   3,   4,   4 ] (i.e. 11 222 33 44)

values (here missing data) are discarded.

Kpax3.categorical2binary — Method

Convert categorical (integer) data to binary

Description

Convert an integer matrix to a binary (indicator) matrix.

Usage

categorical2binary(data, maxval, missval)

Arguments

data Integer matrix to be converted
maxval Theoretical maximum value observable in data
missval Value to be considered missing

Value

A tuple containing the following variables:

bindata Original data matrix encoded as a binary (indicator) matrix
val vector with unique values per MSA site
key vector with indices of each value

Example

If data consists of just the following three units

then bindata will be equal to

while

val = [1, 2, 1, 2, 3, 2, 4, 2, 3] (i.e. 12 123 24 23)
key = [1, 1, 2, 2, 2, 3, 3, 4, 4] (i.e. 11 222 33 44)

0 values (here missing data) are discarded.

Kpax3.normalizepartition — Method

remove "gaps" and non-positive values from the partition Example: [1, 1, 0, 1, -2, 4, 0] -> [3, 3, 2, 3, 1, 4, 2]

Kpax3.readfasta — Method

readfasta(ifile::String, protein::Bool, miss::Vector{UInt8}, l::Int, verbose::Bool, verbosestep::Int)

Read data in FASTA format and convert it to an integer matrix. Sequences are required to be aligned. Only polymorphic columns are stored.

Arguments

ifile Path to the input data file
proteintrue if reading protein data or false if reading DNA data
miss Characters (as UInt8) to be considered missing. Use

miss = zeros(UInt8, 1) if all characters are to be considered valid. Default characters for miss are:

- DNA data: _?, \*, #, -, b, d, h, k, m, n, r, s, v, w, x, y, j, z_
- Protein data: _?, \*, #, -, b, j, x, z_

l Sequence length. If unknown, it is better to choose a value which is

surely greater than the real sequence length. If l is found to be insufficient, the array size is dynamically increased (not recommended from a performance point of view). Default value should be sufficient for most datasets

verbose If true, print status reports
verbosestep Print a status report every verbosestep read sequences

Details

When computing evolutionary distances, don't put the gap symbol - among the missing values. Indeed, indels are an important piece of information for genetic distances.

FASTA data is encoded as standard 7-bit ASCII codes. The only exception is Uracil which is given the same value 84 of Thymidine, i.e. each 'u' is silently converted to 't' when reading DNA data. Conversion tables are the following:

+––––––––––––––––––––+ | Conversion table (DNA) | +––––––––––––––––––––+ | Nucleotide | Code | Integer | +––––––––––-+–––-+–––––+ | Adenosine | A | 97 | | Cytosine | C | 99 | | Guanine | G | 103 | | Thymidine | T | 116 | | Uracil | U | 116 | | Purine (A or G) | R | 114 | | Pyrimidine (C or T) | Y | 121 | | Keto | K | 107 | | Amino | M | 109 | | Strong Interaction | S | 115 | | Weak Interaction | W | 119 | | Not A | B | 98 | | Not C | D | 100 | | Not G | H | 104 | | Not T or U | V | 118 | | Any | N | 110 | | Gap | - | 45 | | Masked | X | 120 | +––––––––––––––––––––+

+––––––––––––––––––––––––+ | Conversion table (PROTEIN) | +––––––––––––––––––––––––+ | Amino Acid | Code | Integer | +––––––––––––––-+–––-+–––––+ | Alanine | A | 97 | | Arginine | R | 114 | | Asparagine | N | 110 | | Aspartic acid | D | 100 | | Cysteine | C | 99 | | Glutamine | Q | 113 | | Glutamic acid | E | 101 | | Glycine | G | 103 | | Histidine | H | 104 | | Isoleucine | I | 105 | | Leucine | L | 108 | | Lysine | K | 107 | | Methionine | M | 109 | | Phenylalanine | F | 102 | | Proline | P | 112 | | Pyrrolysine | O | 111 | | Selenocysteine | U | 117 | | Serine | S | 115 | | Threonine | T | 116 | | Tryptophan | W | 119 | | Tyrosine | Y | 121 | | Valine | V | 118 | | Asparagine or Aspartic acid | B | 98 | | Glutamine or Glutamic acid | Z | 122 | | Leucine or Isoleucine | J | 106 | | Gap | - | 45 | | Translation stop | * | 42 | | Any | X | 120 | +––––––––––––––––––––––––+

Value

A tuple containing the following variables:

data Multiple Sequence Alignment (MSA) encoded as a UInt8 matrix
id Units' ids
ref Reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 46 ('.')

Kpax3.EwensPitman — Type

Ewens-Pitman distribution

Description

The Ewens-Pitman distribution is a discrete probability distribution on the partitions of the set N = {1, ..., n} (n = 1, 2, ...). It is a two parameters generalization of the Ewens sampling formula. The latter is also known as the Chinese Restaurant Process or as the marginal distribution of a Dirichlet Process.

Fields

α Real number (see details)
θ Real number (see details)

Details

The two parameters must satisfy either

α < 0 and θ = -Lα for some L ∈ {1, 2, ...}
0 ≤ α < 1 and θ > − α

The special case α = 0 and θ > 0 corresponds to the Ewens sampling formula.

References

Aldous, D. J. (1985) Exchangeability and related topics. In École d'Été de Probabilités de Saint-Flour XIII — 1983. Lecture Notes in Mathematics 1117, 1-198. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/BFb0099421.

Gnedin, A. and Pitman, J. (2006) Exchangeable Gibbs partitions and Stirling triangles. Journal of Mathematical Sciences138(3), 5674-5685. http://dx.doi.org/10.1007/s10958-006-0335-z.

Kerov, S. (2006) Coherent random allocations, and the Ewens-Pitman formula. Journal of Mathematical Sciences, 138(3). http://dx.doi.org/10.1007/s10958-006-0338-9.

Pitman, J. (1995) Exchangeable and Partially Exchangeable Random Partitions. Probability Theory and Related Fields102(2), 145-158. http://dx.doi.org/10.1007%2FBF01213386.

Pitman, J. (2006) Combinatorial Stochastic Processes. In Ecole d’Eté de Probabilités de Saint-Flour XXXII – 2002. Lecture Notes in Mathematics 1875. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/b11601500.

Kpax3.EwensPitmanNAPT — Type

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with α < 0 and θ > 0. θ = -Lα for some L ∈ {1, 2, ...}.

Fields

α Real number lesser than zero
L Integer number greater than zero

Kpax3.EwensPitmanPAUT — Type

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with 0 < α < 1, θ > -α and θ ≠ 0.

Fields

α Real number greater than zero and lesser than one
θ Real number greater than -α but different from zero

Kpax3.EwensPitmanPAZT — Type

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with 0 < α < 1 and θ = 0.

Fields

α Real number greater than zero and lesser than one

Kpax3.EwensPitmanZAPT — Type

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with α = 0 and θ > 0. This is equivalent to the Ewens sampling formula.

Fields

θ Real number greater than zero

Kpax3.KCSVError — Type

Kpax3 Exception

Description

Exception for a wrong formatted CSV file.

Fields

msg Optional argument with a descriptive error string

Kpax3.KDomainError — Type

Kpax3 Exception

Description

Provides a message explaining the reason of the DomainError exception.

Fields

msg Optional argument with a descriptive error string

Kpax3.KFASTAError — Type

Kpax3 Exception

Description

Exception for a wrong formatted FASTA file.

Fields

msg Optional argument with a descriptive error string

Kpax3.KInputError — Type

Kpax3 Exception

Description

Exception for wrong data read from a source.

Fields

msg Optional argument with a descriptive error string

Kpax3.NucleotideData — Type

Genetic data

Description

DNA data and its metadata.

Fields

data Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrix
id units' ids
ref reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29

val vector with unique values per MSA site
key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at SNP l. Define m = m1 + ... + mL, where L is the total number of SNPs.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the SNP position by findall(ref == 29)[key[j]].

References

Kpax3.dPriorRow — Method

Density of the Ewens-Pitman distribution

Description

Probability of a partition according to the Ewens-Pitman distribution.

Usage

dPriorRow(ep, p) dPriorRow(ep, n, k, m)

Arguments