Kpax3.AminoAcidDataType

Genetic data

Description

Amino acid data and its metadata.

Fields

  • data Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrix
  • id units' ids
  • ref reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29

  • val vector with unique values per MSA site
  • key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at SNP l. Define m = m1 + ... + mL, where L is the total number of SNPs.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the SNP position by findall(ref == 29)[key[j]].

References

Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.

Kpax3.KSettingsType

User defined settings for a Kpax3 run

Description

Fields

Kpax3.CategoricalDataMethod

CSV data

Description

Generic data and its metadata.

Fields

  • data original data matrix encoded as a binary (UInt8) matrix
  • id units' ids
  • ref reference observation, i.e. a vector of the same length of the original

observations storing the values of homogeneous sites. Polymorphisms are instead represented by the string "."

  • val vector with unique values per dataset column
  • key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at polymorphic column l. Define m = m1 + ... + mL, where L is the total number of polymorphic columns.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the polymorphic position by findall(ref == ".")[key[j]].

References

Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.

Kpax3.categorical2binaryMethod

Convert categorical (string) data to binary

Description

Convert a string matrix to a binary (indicator) matrix.

Usage

categorical2binary(data, missval)

Arguments

  • data Integer matrix to be converted
  • missval Value to be considered missing

Value

A tuple containing the following variables:

  • bindata Original data matrix encoded as a binary (indicator) matrix
  • val vector with unique values per MSA site
  • key vector with indices of each value

Example

If data consists of just the following three units

  C A
A G C
C T
C C G

then bindata will be equal to

0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 1 0
1 1 0
0 0 1

while

val = ["A", "C", "A", "C", "G, "C", "T", "C", "G"] (i.e. AC ACG CT CG)
key = [ 1,   1,   2,   2,   2,  3,   3,   4,   4 ] (i.e. 11 222 33 44)

values (here missing data) are discarded.

Kpax3.categorical2binaryMethod

Convert categorical (integer) data to binary

Description

Convert an integer matrix to a binary (indicator) matrix.

Usage

categorical2binary(data, maxval, missval)

Arguments

  • data Integer matrix to be converted
  • maxval Theoretical maximum value observable in data
  • missval Value to be considered missing

Value

A tuple containing the following variables:

  • bindata Original data matrix encoded as a binary (indicator) matrix
  • val vector with unique values per MSA site
  • key vector with indices of each value

Example

If data consists of just the following three units

0 2 1
1 3 2
2 4 0
2 2 3

then bindata will be equal to

0 0 1
0 1 0
1 0 0
0 0 1
0 1 0
1 0 0
0 1 0
1 1 0
0 0 1

while

val = [1, 2, 1, 2, 3, 2, 4, 2, 3] (i.e. 12 123 24 23)
key = [1, 1, 2, 2, 2, 3, 3, 4, 4] (i.e. 11 222 33 44)

0 values (here missing data) are discarded.

Kpax3.normalizepartitionMethod

remove "gaps" and non-positive values from the partition Example: [1, 1, 0, 1, -2, 4, 0] -> [3, 3, 2, 3, 1, 4, 2]

Kpax3.readfastaMethod

readfasta(ifile::String, protein::Bool, miss::Vector{UInt8}, l::Int, verbose::Bool, verbosestep::Int)

Read data in FASTA format and convert it to an integer matrix. Sequences are required to be aligned. Only polymorphic columns are stored.

Arguments

  • ifile Path to the input data file
  • proteintrue if reading protein data or false if reading DNA data
  • miss Characters (as UInt8) to be considered missing. Use

miss = zeros(UInt8, 1) if all characters are to be considered valid. Default characters for miss are:

- DNA data: _?, \*, #, -, b, d, h, k, m, n, r, s, v, w, x, y, j, z_
- Protein data: _?, \*, #, -, b, j, x, z_
  • l Sequence length. If unknown, it is better to choose a value which is

surely greater than the real sequence length. If l is found to be insufficient, the array size is dynamically increased (not recommended from a performance point of view). Default value should be sufficient for most datasets

  • verbose If true, print status reports
  • verbosestep Print a status report every verbosestep read sequences

Details

When computing evolutionary distances, don't put the gap symbol - among the missing values. Indeed, indels are an important piece of information for genetic distances.

FASTA data is encoded as standard 7-bit ASCII codes. The only exception is Uracil which is given the same value 84 of Thymidine, i.e. each 'u' is silently converted to 't' when reading DNA data. Conversion tables are the following:

+––––––––––––––––––––+ | Conversion table (DNA) | +––––––––––––––––––––+ | Nucleotide | Code | Integer | +––––––––––-+–––-+–––––+ | Adenosine | A | 97 | | Cytosine | C | 99 | | Guanine | G | 103 | | Thymidine | T | 116 | | Uracil | U | 116 | | Purine (A or G) | R | 114 | | Pyrimidine (C or T) | Y | 121 | | Keto | K | 107 | | Amino | M | 109 | | Strong Interaction | S | 115 | | Weak Interaction | W | 119 | | Not A | B | 98 | | Not C | D | 100 | | Not G | H | 104 | | Not T or U | V | 118 | | Any | N | 110 | | Gap | - | 45 | | Masked | X | 120 | +––––––––––––––––––––+

+––––––––––––––––––––––––+ | Conversion table (PROTEIN) | +––––––––––––––––––––––––+ | Amino Acid | Code | Integer | +––––––––––––––-+–––-+–––––+ | Alanine | A | 97 | | Arginine | R | 114 | | Asparagine | N | 110 | | Aspartic acid | D | 100 | | Cysteine | C | 99 | | Glutamine | Q | 113 | | Glutamic acid | E | 101 | | Glycine | G | 103 | | Histidine | H | 104 | | Isoleucine | I | 105 | | Leucine | L | 108 | | Lysine | K | 107 | | Methionine | M | 109 | | Phenylalanine | F | 102 | | Proline | P | 112 | | Pyrrolysine | O | 111 | | Selenocysteine | U | 117 | | Serine | S | 115 | | Threonine | T | 116 | | Tryptophan | W | 119 | | Tyrosine | Y | 121 | | Valine | V | 118 | | Asparagine or Aspartic acid | B | 98 | | Glutamine or Glutamic acid | Z | 122 | | Leucine or Isoleucine | J | 106 | | Gap | - | 45 | | Translation stop | * | 42 | | Any | X | 120 | +––––––––––––––––––––––––+

Value

A tuple containing the following variables:

  • data Multiple Sequence Alignment (MSA) encoded as a UInt8 matrix
  • id Units' ids
  • ref Reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 46 ('.')

Kpax3.EwensPitmanType

Ewens-Pitman distribution

Description

The Ewens-Pitman distribution is a discrete probability distribution on the partitions of the set N = {1, ..., n} (n = 1, 2, ...). It is a two parameters generalization of the Ewens sampling formula. The latter is also known as the Chinese Restaurant Process or as the marginal distribution of a Dirichlet Process.

Fields

  • α Real number (see details)
  • θ Real number (see details)

Details

The two parameters must satisfy either

  • α < 0 and θ = -Lα for some L ∈ {1, 2, ...}
  • 0 ≤ α < 1 and θ > − α

The special case α = 0 and θ > 0 corresponds to the Ewens sampling formula.

References

Aldous, D. J. (1985) Exchangeability and related topics. In École d'Été de Probabilités de Saint-Flour XIII — 1983. Lecture Notes in Mathematics 1117, 1-198. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/BFb0099421.

Gnedin, A. and Pitman, J. (2006) Exchangeable Gibbs partitions and Stirling triangles. Journal of Mathematical Sciences138(3), 5674-5685. http://dx.doi.org/10.1007/s10958-006-0335-z.

Kerov, S. (2006) Coherent random allocations, and the Ewens-Pitman formula. Journal of Mathematical Sciences, 138(3). http://dx.doi.org/10.1007/s10958-006-0338-9.

Pitman, J. (1995) Exchangeable and Partially Exchangeable Random Partitions. Probability Theory and Related Fields102(2), 145-158. http://dx.doi.org/10.1007%2FBF01213386.

Pitman, J. (2006) Combinatorial Stochastic Processes. In Ecole d’Eté de Probabilités de Saint-Flour XXXII – 2002. Lecture Notes in Mathematics 1875. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/b11601500.

Kpax3.EwensPitmanNAPTType

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with α < 0 and θ > 0. θ = -Lα for some L ∈ {1, 2, ...}.

Fields

  • α Real number lesser than zero
  • L Integer number greater than zero
Kpax3.EwensPitmanPAUTType

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with 0 < α < 1, θ > -α and θ ≠ 0.

Fields

  • α Real number greater than zero and lesser than one
  • θ Real number greater than but different from zero
Kpax3.EwensPitmanPAZTType

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with 0 < α < 1 and θ = 0.

Fields

  • α Real number greater than zero and lesser than one
Kpax3.EwensPitmanZAPTType

Ewens-Pitman distribution

Description

Ewens-Pitman distribution with α = 0 and θ > 0. This is equivalent to the Ewens sampling formula.

Fields

  • θ Real number greater than zero
Kpax3.KCSVErrorType

Kpax3 Exception

Description

Exception for a wrong formatted CSV file.

Fields

  • msg Optional argument with a descriptive error string
Kpax3.KDomainErrorType

Kpax3 Exception

Description

Provides a message explaining the reason of the DomainError exception.

Fields

  • msg Optional argument with a descriptive error string
Kpax3.KFASTAErrorType

Kpax3 Exception

Description

Exception for a wrong formatted FASTA file.

Fields

  • msg Optional argument with a descriptive error string
Kpax3.KInputErrorType

Kpax3 Exception

Description

Exception for wrong data read from a source.

Fields

  • msg Optional argument with a descriptive error string
Kpax3.NucleotideDataType

Genetic data

Description

DNA data and its metadata.

Fields

  • data Multiple sequence alignment (MSA) encoded as a binary (UInt8) matrix
  • id units' ids
  • ref reference sequence, i.e. a vector of the same length of the original

sequences storing the values of homogeneous sites. SNPs are instead represented by a value of 29

  • val vector with unique values per MSA site
  • key vector with indices of each value

Details

Let n be the total number of units and ml be the total number of unique values observed at SNP l. Define m = m1 + ... + mL, where L is the total number of SNPs.

data is a m-by-n indicator matrix, i.e. data[j, i] is 1 if unit i possesses value j, 0 otherwise.

The value associated with column j can be obtained by val[j] while the SNP position by findall(ref == 29)[key[j]].

References

Pessia A., Grad Y., Cobey S., Puranen J. S. and Corander J. (2015). K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets. Microbial Genomics1(1). http://dx.doi.org/10.1099/mgen.0.000025.

Kpax3.dPriorRowMethod

Density of the Ewens-Pitman distribution

Description

Probability of a partition according to the Ewens-Pitman distribution.

Usage

dPriorRow(ep, p) dPriorRow(ep, n, k, m)

Arguments

  • ep Object of (super)type EwensPitman
  • p Vector of integers representing a partition
  • n Set size (Integer)
  • k Number of blocks (Integer)
  • m Vector of integers representing block sizes

Details

Examples