Kpax3.AminoAcidData
— TypeGenetic data
Description
Amino acid data and its metadata.
Fields
data
Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrixid
units' idsref
reference sequence, i.e. a vector of the same length of the original
sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29
val
vector with unique values per MSA sitekey
vector with indices of each value
Details
Let n
be the total number of units and ml
be the total number of unique values observed at SNP l
. Define m = m1 + ... + mL, where L is the total number of SNPs.
data
is a m
-by-n
indicator matrix, i.e. data[j, i]
is 1
if unit i
possesses value j
, 0
otherwise.
The value associated with column j
can be obtained by val[j]
while the SNP position by findall(ref == 29)[key[j]]
.
References
Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.
Kpax3.KSettings
— TypeUser defined settings for a Kpax3 run
Description
Fields
Kpax3.CategoricalData
— MethodCSV data
Description
Generic data and its metadata.
Fields
data
original data matrix encoded as a binary (UInt8) matrixid
units' idsref
reference observation, i.e. a vector of the same length of the original
observations storing the values of homogeneous sites. Polymorphisms are instead represented by the string "."
val
vector with unique values per dataset columnkey
vector with indices of each value
Details
Let n
be the total number of units and ml
be the total number of unique values observed at polymorphic column l
. Define m = m1 + ... + mL, where L is the total number of polymorphic columns.
data
is a m
-by-n
indicator matrix, i.e. data[j, i]
is 1
if unit i
possesses value j
, 0
otherwise.
The value associated with column j
can be obtained by val[j]
while the polymorphic position by findall(ref == ".")[key[j]]
.
References
Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.
Kpax3.categorical2binary
— MethodConvert categorical (string) data to binary
Description
Convert a string matrix to a binary (indicator) matrix.
Usage
categorical2binary(data, missval)
Arguments
data
Integer matrix to be convertedmissval
Value to be considered missing
Value
A tuple containing the following variables:
bindata
Original data matrix encoded as a binary (indicator) matrixval
vector with unique values per MSA sitekey
vector with indices of each value
Example
If data
consists of just the following three units
C A
A G C
C T
C C G
then bindata
will be equal to
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 1 0
1 1 0
0 0 1
while
val = ["A", "C", "A", "C", "G, "C", "T", "C", "G"] (i.e. AC ACG CT CG)
key = [ 1, 1, 2, 2, 2, 3, 3, 4, 4 ] (i.e. 11 222 33 44)
values (here missing data) are discarded.
Kpax3.categorical2binary
— MethodConvert categorical (integer) data to binary
Description
Convert an integer matrix to a binary (indicator) matrix.
Usage
categorical2binary(data, maxval, missval)
Arguments
data
Integer matrix to be convertedmaxval
Theoretical maximum value observable indata
missval
Value to be considered missing
Value
A tuple containing the following variables:
bindata
Original data matrix encoded as a binary (indicator) matrixval
vector with unique values per MSA sitekey
vector with indices of each value
Example
If data
consists of just the following three units
0 2 1
1 3 2
2 4 0
2 2 3
then bindata
will be equal to
0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 1 0
1 1 0
0 0 1
while
val = [1, 2, 1, 2, 3, 2, 4, 2, 3] (i.e. 12 123 24 23)
key = [1, 1, 2, 2, 2, 3, 3, 4, 4] (i.e. 11 222 33 44)
0
values (here missing data) are discarded.
Kpax3.normalizepartition
— Methodremove "gaps" and non-positive values from the partition Example: [1, 1, 0, 1, -2, 4, 0] -> [3, 3, 2, 3, 1, 4, 2]
Kpax3.readfasta
— Methodreadfasta(ifile::String, protein::Bool, miss::Vector{UInt8}, l::Int, verbose::Bool, verbosestep::Int)
Read data in FASTA format and convert it to an integer matrix. Sequences are required to be aligned. Only polymorphic columns are stored.
Arguments
ifile
Path to the input data fileprotein
true
if reading protein data orfalse
if reading DNA datamiss
Characters (asUInt8
) to be considered missing. Use
miss = zeros(UInt8, 1)
if all characters are to be considered valid. Default characters for miss
are:
- DNA data: _?, \*, #, -, b, d, h, k, m, n, r, s, v, w, x, y, j, z_
- Protein data: _?, \*, #, -, b, j, x, z_
l
Sequence length. If unknown, it is better to choose a value which is
surely greater than the real sequence length. If l
is found to be insufficient, the array size is dynamically increased (not recommended from a performance point of view). Default value should be sufficient for most datasets
verbose
Iftrue
, print status reportsverbosestep
Print a status report everyverbosestep
read sequences
Details
When computing evolutionary distances, don't put the gap symbol -
among the missing values. Indeed, indels are an important piece of information for genetic distances.
FASTA data is encoded as standard 7-bit ASCII codes. The only exception is Uracil which is given the same value 84 of Thymidine, i.e. each 'u' is silently converted to 't' when reading DNA data. Conversion tables are the following:
+––––––––––––––––––––+ | Conversion table (DNA) | +––––––––––––––––––––+ | Nucleotide | Code | Integer | +––––––––––-+–––-+–––––+ | Adenosine | A | 97 | | Cytosine | C | 99 | | Guanine | G | 103 | | Thymidine | T | 116 | | Uracil | U | 116 | | Purine (A or G) | R | 114 | | Pyrimidine (C or T) | Y | 121 | | Keto | K | 107 | | Amino | M | 109 | | Strong Interaction | S | 115 | | Weak Interaction | W | 119 | | Not A | B | 98 | | Not C | D | 100 | | Not G | H | 104 | | Not T or U | V | 118 | | Any | N | 110 | | Gap | - | 45 | | Masked | X | 120 | +––––––––––––––––––––+
+––––––––––––––––––––––––+ | Conversion table (PROTEIN) | +––––––––––––––––––––––––+ | Amino Acid | Code | Integer | +––––––––––––––-+–––-+–––––+ | Alanine | A | 97 | | Arginine | R | 114 | | Asparagine | N | 110 | | Aspartic acid | D | 100 | | Cysteine | C | 99 | | Glutamine | Q | 113 | | Glutamic acid | E | 101 | | Glycine | G | 103 | | Histidine | H | 104 | | Isoleucine | I | 105 | | Leucine | L | 108 | | Lysine | K | 107 | | Methionine | M | 109 | | Phenylalanine | F | 102 | | Proline | P | 112 | | Pyrrolysine | O | 111 | | Selenocysteine | U | 117 | | Serine | S | 115 | | Threonine | T | 116 | | Tryptophan | W | 119 | | Tyrosine | Y | 121 | | Valine | V | 118 | | Asparagine or Aspartic acid | B | 98 | | Glutamine or Glutamic acid | Z | 122 | | Leucine or Isoleucine | J | 106 | | Gap | - | 45 | | Translation stop | * | 42 | | Any | X | 120 | +––––––––––––––––––––––––+
Value
A tuple containing the following variables:
data
Multiple Sequence Alignment (MSA) encoded as a UInt8 matrixid
Units' idsref
Reference sequence, i.e. a vector of the same length of the original
sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 46 ('.')
Kpax3.EwensPitman
— TypeEwens-Pitman distribution
Description
The Ewens-Pitman distribution is a discrete probability distribution on the partitions of the set N = {1, ..., n} (n = 1, 2, ...). It is a two parameters generalization of the Ewens sampling formula. The latter is also known as the Chinese Restaurant Process or as the marginal distribution of a Dirichlet Process.
Fields
α
Real number (see details)θ
Real number (see details)
Details
The two parameters must satisfy either
- α < 0 and θ = -Lα for some L ∈ {1, 2, ...}
- 0 ≤ α < 1 and θ > − α
The special case α = 0 and θ > 0 corresponds to the Ewens sampling formula.
References
Aldous, D. J. (1985) Exchangeability and related topics. In École d'Été de Probabilités de Saint-Flour XIII — 1983. Lecture Notes in Mathematics 1117, 1-198. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/BFb0099421.
Gnedin, A. and Pitman, J. (2006) Exchangeable Gibbs partitions and Stirling triangles. Journal of Mathematical Sciences138(3), 5674-5685. http://dx.doi.org/10.1007/s10958-006-0335-z.
Kerov, S. (2006) Coherent random allocations, and the Ewens-Pitman formula. Journal of Mathematical Sciences, 138(3). http://dx.doi.org/10.1007/s10958-006-0338-9.
Pitman, J. (1995) Exchangeable and Partially Exchangeable Random Partitions. Probability Theory and Related Fields102(2), 145-158. http://dx.doi.org/10.1007%2FBF01213386.
Pitman, J. (2006) Combinatorial Stochastic Processes. In Ecole d’Eté de Probabilités de Saint-Flour XXXII – 2002. Lecture Notes in Mathematics 1875. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/b11601500.
Kpax3.EwensPitmanNAPT
— TypeEwens-Pitman distribution
Description
Ewens-Pitman distribution with α < 0 and θ > 0. θ = -Lα for some L ∈ {1, 2, ...}.
Fields
α
Real number lesser than zeroL
Integer number greater than zero
Kpax3.EwensPitmanPAUT
— TypeEwens-Pitman distribution
Description
Ewens-Pitman distribution with 0 < α < 1, θ > -α and θ ≠ 0.
Fields
α
Real number greater than zero and lesser than oneθ
Real number greater than-α
but different from zero
Kpax3.EwensPitmanPAZT
— TypeEwens-Pitman distribution
Description
Ewens-Pitman distribution with 0 < α < 1 and θ = 0.
Fields
α
Real number greater than zero and lesser than one
Kpax3.EwensPitmanZAPT
— TypeEwens-Pitman distribution
Description
Ewens-Pitman distribution with α = 0 and θ > 0. This is equivalent to the Ewens sampling formula.
Fields
θ
Real number greater than zero
Kpax3.KCSVError
— TypeKpax3 Exception
Description
Exception for a wrong formatted CSV file.
Fields
msg
Optional argument with a descriptive error string
Kpax3.KDomainError
— TypeKpax3 Exception
Description
Provides a message explaining the reason of the DomainError exception.
Fields
msg
Optional argument with a descriptive error string
Kpax3.KFASTAError
— TypeKpax3 Exception
Description
Exception for a wrong formatted FASTA file.
Fields
msg
Optional argument with a descriptive error string
Kpax3.KInputError
— TypeKpax3 Exception
Description
Exception for wrong data read from a source.
Fields
msg
Optional argument with a descriptive error string
Kpax3.NucleotideData
— TypeGenetic data
Description
DNA data and its metadata.
Fields
data
Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrixid
units' idsref
reference sequence, i.e. a vector of the same length of the original
sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29
val
vector with unique values per MSA sitekey
vector with indices of each value
Details
Let n
be the total number of units and ml
be the total number of unique values observed at SNP l
. Define m = m1 + ... + mL, where L is the total number of SNPs.
data
is a m
-by-n
indicator matrix, i.e. data[j, i]
is 1
if unit i
possesses value j
, 0
otherwise.
The value associated with column j
can be obtained by val[j]
while the SNP position by findall(ref == 29)[key[j]]
.
References
Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.
Kpax3.dPriorRow
— MethodDensity of the Ewens-Pitman distribution
Description
Probability of a partition according to the Ewens-Pitman distribution.
Usage
dPriorRow(ep, p) dPriorRow(ep, n, k, m)
Arguments
ep
Object of (super)type EwensPitmanp
Vector of integers representing a partitionn
Set size (Integer)k
Number of blocks (Integer)m
Vector of integers representing block sizes
Details
Examples
Kpax3.logcondpostC
— MethodLogp(C | R, X)
Kpax3.logcondpostS
— Methodlogp(S | R, X)
Kpax3.logposterior
— MethodLogposterior minus the log(normalizing constant)
Kpax3.logpriorC
— Methodlogp(C | R)